Adapting the Behavior of Reinforcement Learning Agents to Changing Action Spaces and Reward Functions
Reinforcement Learning (RL) agents often struggle in real-world applications where environmental conditions are non-stationary, particularly when reward functions shift or the available action space expands. This paper introduces MORPHIN, a self-adaptive Q-learning framework that enables on-the-fly adaptation without full retraining. By integrating concept drift detection with dynamic adjustments to learning and exploration hyperparameters, MORPHIN adapts agents to changes in both the reward function and on-the-fly expansions of the agent’s action space, while preserving prior policy knowledge to prevent catastrophic forgetting. We validate our approach using a Gridworld benchmark and a traffic signal control simulation. The results demonstrate that MORPHIN achieves superior convergence speed and continuous adaptation compared to a standard Q-learning baseline, improving learning efficiency by up to 1.7x.
💡 Research Summary
Reinforcement learning (RL) agents are typically designed under the assumption of a stationary Markov Decision Process (MDP). In many real‑world settings, however, the reward function can shift and new actions may become available, breaking the stationarity assumption and causing standard Q‑learning agents to perform poorly or even diverge. This paper introduces MORPHIN, a self‑adaptive Q‑learning framework that detects such non‑stationarities on‑line and automatically adjusts both the exploration rate (ε) and the learning rate (α) to accommodate them without discarding previously learned knowledge.
The core of MORPHIN is a two‑stage adaptation loop. First, a Page‑Hinkley (PH) test monitors the cumulative episode reward R_ep after each episode. When the cumulative deviation exceeds a user‑defined threshold H (controlled by a sensitivity parameter δ), a drift is declared. Upon drift detection, MORPHIN resets an exploration counter e to zero, which forces ε to its maximum value (≈1) through an exponential decay schedule ε_t = ε_min + (1‑ε_min)·exp(‑decay_rate·e). This immediate re‑exploration guarantees that the agent samples the altered reward landscape or newly introduced actions sufficiently.
Second, MORPHIN computes the temporal‑difference (TD) error δ_TD = r_{t+1} + γ·max_a Q(s_{t+1},a) – Q(s_t,a_t). Large absolute TD errors indicate a mismatch between the current Q‑values and the new environment dynamics. MORPHIN translates this signal into a dynamic learning rate α* using a sigmoid‑shaped function: α* = α + (α_max – α)·
Comments & Academic Discussion
Loading comments...
Leave a Comment