A Controlled Study of Double DQN and Dueling DQN Under Cross-Environment Transfer
Transfer learning in deep reinforcement learning is often motivated by improved stability and reduced training cost, but it can also fail under substantial domain shift. This paper presents a controlled empirical study examining how architectural differences between Double Deep Q-Networks (DDQN) and Dueling DQN influence transfer behavior across environments. Using CartPole as a source task and LunarLander as a structurally distinct target task, we evaluate a fixed layer-wise representation transfer protocol under identical hyperparameters and training conditions, with baseline agents trained from scratch used to contextualize transfer effects. Empirical results show that DDQN consistently avoids negative transfer under the examined setup and maintains learning dynamics comparable to baseline performance in the target environment. In contrast, Dueling DQN consistently exhibits negative transfer under identical conditions, characterized by degraded rewards and unstable optimization behavior. Statistical analysis across multiple random seeds confirms a significant performance gap under transfer. These findings suggest that architectural inductive bias is strongly associated with robustness to cross-environment transfer in value-based deep reinforcement learning under the examined transfer protocol.
💡 Research Summary
The paper presents a rigorously controlled empirical study that investigates how two popular architectural extensions of the Deep Q‑Network—Double DQN (DDQN) and Dueling DQN—behave when their learned representations are transferred from a simple source task (CartPole‑v2) to a substantially more complex target task (LunarLander‑v3). The authors deliberately fix every non‑architectural factor: the same three‑layer fully‑connected network (128‑128‑64 units), identical optimizer settings (Adam, learning rates 1e‑4 for CartPole, 5e‑4 for LunarLander), replay buffer size, batch size, epsilon‑greedy schedule, and soft‑update parameters. They also keep the transfer protocol constant: only the first two hidden layers of the pre‑trained model are reused, the input and output layers are re‑initialized to match LunarLander’s 8‑dimensional state and 4‑action space, and the transferred layers are frozen for the first 100 episodes of target‑task training before being unfrozen for fine‑tuning. Five random seeds are used for each configuration, and results are reported as means with standard deviations; statistical significance is assessed with two‑sample t‑tests at the 95 % confidence level.
In the source environment both DDQN and Dueling DQN achieve near‑optimal performance (average reward ≈ 200) within 150–200 episodes, confirming that the two architectures learn comparable representations under identical conditions. The transfer experiments, however, reveal a stark divergence. DDQN’s transferred agents start with a modest validation reward around –60 (due to the initial mismatch between source and target dynamics) but quickly recover, surpassing the LunarLander success threshold (≥ 200) after roughly 150 episodes of fine‑tuning. Their training loss remains low (mean‑squared TD error < 2) and exhibits smooth decay, indicating stable Q‑value updates throughout the process. By contrast, Dueling DQN’s transferred agents suffer a severe drop in validation reward to about –370 and never climb above 150, even after the full 500‑episode training budget. Their loss curve oscillates between 8 and 12, reflecting unstable learning and frequent divergence of the advantage stream. The final average validation reward difference between the two architectures is statistically significant (p < 0.01).
The authors interpret these findings through the lens of each architecture’s inductive bias. DDQN’s decoupling of action selection (online network) from value evaluation (target network) mitigates the over‑estimation bias that often plagues vanilla DQN, and this bias‑reduction carries over when the high‑level feature extractor is reused in a new domain. Consequently, the transferred representation provides a relatively unbiased estimate of Q‑values even under the domain shift, allowing the agent to adapt smoothly. In contrast, Dueling DQN splits the Q‑function into a state‑value stream and an advantage stream. While this decomposition can improve sample efficiency in environments where the advantage varies little across actions, it also makes the network heavily reliant on the learned balance between V(s) and A(s,a). When the source task’s advantage structure (CartPole’s simple left/right balance) is transplanted to LunarLander’s more intricate thrust‑control dynamics, the advantage stream introduces a systematic mismatch that amplifies estimation errors. The frozen‑layer phase further locks in this mismatch, leading to the observed negative transfer and unstable loss.
The paper acknowledges several limitations. The transfer protocol is intentionally naïve—only two hidden layers are reused, and no sophisticated domain‑adaptation techniques (e.g., adversarial alignment, meta‑learning of initializations, or progressive unfreezing) are explored. Consequently, the results reflect a worst‑case scenario for each architecture rather than the best possible transfer performance. Moreover, the study is confined to a single source‑target pair; while CartPole→LunarLander provides a clear domain shift, broader generalization would require additional environments (e.g., Atari games, continuous‑control MuJoCo tasks). Finally, the authors do not compare against hybrid architectures such as Rainbow, which combine both Double Q‑learning and dueling streams, leaving open the question of whether the combination could inherit the robustness of DDQN while still benefiting from the representational efficiency of the dueling head.
Despite these constraints, the work makes a clear and valuable contribution: it isolates the effect of architectural inductive bias on transfer learning in value‑based deep RL, demonstrating that DDQN’s bias‑reduction mechanism confers robustness to naïve representation transfer, whereas the dueling decomposition can become a liability under substantial domain shift. The findings suggest that practitioners should carefully consider the underlying architecture when designing transfer pipelines, especially in settings where source and target tasks differ markedly in dynamics or reward sparsity. Future research directions proposed include (1) evaluating a broader suite of source‑target pairs, (2) testing progressive unfreezing schedules and regularization strategies to mitigate negative transfer in dueling networks, and (3) integrating the two architectures within a unified framework (e.g., Rainbow) to assess whether combined inductive biases can yield more universally transferable representations.
Comments & Academic Discussion
Loading comments...
Leave a Comment