Hybrid Differential Reward: Combining Temporal Difference and Action Gradients for Efficient Multi-Agent Reinforcement Learning in Cooperative Driving

Reading time: 6 minute
...

📝 Abstract

In multi-vehicle cooperative driving tasks involving high-frequency continuous control, traditional state-based reward functions suffer from the issue of vanishing reward differences. This phenomenon results in a low signal-to-noise ratio (SNR) for policy gradients, significantly hindering algorithm convergence and performance improvement. To address this challenge, this paper proposes a novel Hybrid Differential Reward (HDR) mechanism. We first theoretically elucidate how the temporal quasi-steady nature of traffic states and the physical proximity of actions lead to the failure of traditional reward signals. Building on this analysis, the HDR framework innovatively integrates two complementary components: (1) a Temporal Difference Reward (TRD) based on a global potential function, which utilizes the evolutionary trend of potential energy to ensure optimal policy invariance and consistency with longterm objectives; and (2) an Action Gradient Reward (ARG), which directly measures the marginal utility of actions to provide a local guidance signal with a high SNR. Furthermore, we formulate the cooperative driving problem as a Multi-Agent Partially Observable Markov Game (POMDPG) with a time-varying agent set and provide a complete instantiation scheme for HDR within this framework. Extensive experiments conducted using both online planning (MCTS) and Multi-Agent Reinforcement Learning (QMIX, MAPPO, MADDPG) algorithms demonstrate that the HDR mechanism significantly improves convergence speed and policy stability. The results confirm that HDR guides agents to learn high-quality cooperative policies that effectively balance traffic efficiency and safety.

💡 Analysis

In multi-vehicle cooperative driving tasks involving high-frequency continuous control, traditional state-based reward functions suffer from the issue of vanishing reward differences. This phenomenon results in a low signal-to-noise ratio (SNR) for policy gradients, significantly hindering algorithm convergence and performance improvement. To address this challenge, this paper proposes a novel Hybrid Differential Reward (HDR) mechanism. We first theoretically elucidate how the temporal quasi-steady nature of traffic states and the physical proximity of actions lead to the failure of traditional reward signals. Building on this analysis, the HDR framework innovatively integrates two complementary components: (1) a Temporal Difference Reward (TRD) based on a global potential function, which utilizes the evolutionary trend of potential energy to ensure optimal policy invariance and consistency with longterm objectives; and (2) an Action Gradient Reward (ARG), which directly measures the marginal utility of actions to provide a local guidance signal with a high SNR. Furthermore, we formulate the cooperative driving problem as a Multi-Agent Partially Observable Markov Game (POMDPG) with a time-varying agent set and provide a complete instantiation scheme for HDR within this framework. Extensive experiments conducted using both online planning (MCTS) and Multi-Agent Reinforcement Learning (QMIX, MAPPO, MADDPG) algorithms demonstrate that the HDR mechanism significantly improves convergence speed and policy stability. The results confirm that HDR guides agents to learn high-quality cooperative policies that effectively balance traffic efficiency and safety.

📄 Content

Hybrid Differential Reward: Combining Temporal Difference and Action Gradients for Efficient Multi-Agent Reinforcement Learning in Cooperative Driving Ye Han, Lijun Zhang∗, Dejian Meng, Zhuang Zhang Abstract—In multi-vehicle cooperative driving tasks involv- ing high-frequency continuous control, traditional state-based reward functions suffer from the issue of vanishing reward differences. This phenomenon results in a low signal-to-noise ratio (SNR) for policy gradients, significantly hindering algo- rithm convergence and performance improvement. To address this challenge, this paper proposes a novel Hybrid Differen- tial Reward (HDR) mechanism. We first theoretically elucidate how the temporal quasi-steady nature of traffic states and the physical proximity of actions lead to the failure of traditional reward signals. Building on this analysis, the HDR framework innovatively integrates two complementary components: (1) a Temporal Difference Reward (TRD) based on a global potential function, which utilizes the evolutionary trend of potential energy to ensure optimal policy invariance and consistency with long- term objectives; and (2) an Action Gradient Reward (ARG), which directly measures the marginal utility of actions to provide a local guidance signal with a high SNR. Furthermore, we formulate the cooperative driving problem as a Multi-Agent Par- tially Observable Markov Game (POMDPG) with a time-varying agent set and provide a complete instantiation scheme for HDR within this framework. Extensive experiments conducted using both online planning (MCTS) and Multi-Agent Reinforcement Learning (QMIX, MAPPO, MADDPG) algorithms demonstrate that the HDR mechanism significantly improves convergence speed and policy stability. The results confirm that HDR guides agents to learn high-quality cooperative policies that effectively balance traffic efficiency and safety. I. INTRODUCTION As autonomous driving technology evolves from single- vehicle intelligence to collective intelligence, cooperative driv- ing has emerged as a critical approach to mitigating traffic congestion and enhancing road safety [1]. Unlike the reactive decision-making of single-vehicle systems, cooperative driving requires multiple agents to engage in strategic interactions and game-theoretic reasoning to seek joint solutions that maxi- mize collective benefits. Multi-Agent Reinforcement Learning (MARL) is widely regarded as an ideal paradigm for achieving this goal due to its capability to handle high-dimensional state spaces and complex game dynamics [2], [3]. However, despite the significant success of MARL in dis- crete games such as StarCraft and Go, its application to high- frequency, continuous decision-making tasks—specifically ve- hicle control—faces a critical yet long-overlooked challenge: the problem of vanishing reward differences [4], [5]. Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang are with the School of Automotive Studies, Tongji University, Shanghai 201804, China. {hanye_leohancnjs, tjedu_zhanglijun, mengdejian, zhuang_zhang}@tongji.edu.cn ∗Corresponding author: Lijun Zhang The root of this problem lies in the inherent conflict between the temporal quasi-steady nature of the physical traffic system and the high-frequency characteristics of decision- making algorithms. Within a short decision interval (e.g., 0.1 seconds), vehicle inertia dictates that macroscopic states, such as position and velocity, undergo only negligible evolution. Traditional state-based reward functions fail to capture these subtle changes, causing the differences in reward signals generated by distinct actions (e.g., rapid acceleration versus gradual acceleration) to be overwhelmed by environmental noise. This results in an excessively low Signal-to-Noise Ratio (SNR) for policy gradients, leading to issues such as vanishing gradients or excessive variance during the early stages of training. Consequently, algorithms struggle to converge to complex cooperative policies. Existing solutions primarily focus on reward shaping [6] or intrinsic motivation [7]. Although classic Potential-Based Reward Shaping (PBRS) [6] theoretically guarantees optimal policy invariance, it fundamentally relies on state differences (ϕ(s′)−ϕ(s)). When state changes are negligible, the guidance signal provided by PBRS remains equally weak, failing to fundamentally resolve the issue of low gradient SNR. In other words, existing methods address the directionality of rewards but fail to resolve the issue of reward sensitivity. To overcome this bottleneck, this paper proposes a novel Hybrid Differential Reward (HDR) mechanism. HDR moves beyond reliance solely on state evaluation by innovatively introducing the dimension of action gradients. It integrates two complementary signals: (1) a Temporal Difference Re- ward (TRD) based on a potential function, which utilizes state evolution trends to ensure consistency with long-term optimization objectives and theoretical completeness; a

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut