Robust Single-Agent Reinforcement Learning for Regional Traffic Signal Control Under Demand Fluctuations

Traffic congestion, primarily driven by intersection queuing, significantly impacts urban living standards, safety, environmental quality, and economic efficiency. While Traffic Signal Control (TSC) systems hold potential for congestion mitigation, traditional optimization models often fail to capture real-world traffic complexity and dynamics. This study introduces a novel single-agent reinforcement learning (RL) framework for regional adaptive TSC, circumventing the coordination complexities inherent in multi-agent systems through a centralized decision-making paradigm. The model employs an adjacency matrix to unify the encoding of road network topology, real-time queue states derived from probe vehicle data, and current signal timing parameters. Leveraging the efficient learning capabilities of the DreamerV3 world model, the agent learns control policies where actions sequentially select intersections and adjust their signal phase splits to regulate traffic inflow/outflow, analogous to a feedback control system. Reward design prioritizes queue dissipation, directly linking congestion metrics (queue length) to control actions. Simulation experiments conducted in SUMO demonstrate the model’s effectiveness: under inference scenarios with multi-level (10%, 20%, 30%) Origin-Destination (OD) demand fluctuations, the framework exhibits robust anti-fluctuation capability and significantly reduces queue lengths. This work establishes a new paradigm for intelligent traffic control compatible with probe vehicle technology. Future research will focus on enhancing practical applicability by incorporating stochastic OD demand fluctuations during training and exploring regional optimization mechanisms for contingency events.

💡 Research Summary

Traffic congestion, especially at intersections, remains a critical challenge for modern cities, affecting safety, environment, and economic productivity. Traditional signal‑control optimization methods often rely on static assumptions and cannot capture the highly nonlinear, stochastic nature of real‑world traffic. Moreover, multi‑agent reinforcement‑learning (MARL) approaches, while promising, suffer from coordination overhead, communication latency, and difficulty in achieving global optimality across a network of intersections. In response, this paper proposes a novel single‑agent reinforcement‑learning framework that treats an entire regional road network as a single Markov decision process (MDP).

Network Representation
The authors encode the road topology using an adjacency matrix, where each edge corresponds to a road segment. Real‑time queue lengths on these segments are obtained from probe‑vehicle data (e.g., GPS or cellular traces) and inserted into the matrix alongside the current signal‑phase split parameters. Consequently, the state at time t (Sₜ) is a unified tensor containing topology, queue states, and signal settings for the whole region. This compact representation enables the agent to perceive the global traffic situation without requiring explicit inter‑agent communication.

Learning Algorithm – DreamerV3
DreamerV3, a state‑of‑the‑art model‑based RL algorithm, is employed as the learning backbone. It first learns a latent dynamics model (RSSM) that compresses observations into a low‑dimensional latent state z and predicts future latent trajectories, rewards, and observations. Using imagined rollouts in this latent space, the policy π and value function V are updated, dramatically reducing the need for costly environment interactions. By leveraging DreamerV3, the authors achieve sample‑efficient training while still capturing the complex temporal dependencies inherent in traffic flow.

Action Design
The action space is hierarchical: (1) select an intersection to adjust, and (2) modify its phase‑split ratios (continuous values). This sequential decision mimics a classic feedback‑control loop—first identifying the most congested node, then applying a corrective signal adjustment—allowing fine‑grained control over traffic inflow and outflow.

Reward Structure
The reward at each step is defined as the reduction in average queue length across the network: rₜ = Lₜ₋₁ – Lₜ, where Lₜ denotes the mean queue length at time t. A small penalty for abrupt phase changes is added to discourage excessive driver discomfort. By directly linking the objective to congestion metrics, the learned policy is incentivized to dissipate queues as quickly as possible.

Experimental Setup
Experiments are conducted in the SUMO microscopic traffic simulator on a realistic urban sub‑network comprising eight intersections and twelve arterial links. During training, a fixed origin‑destination (OD) demand pattern is used. For evaluation, the authors introduce three levels of demand fluctuation (±10 %, ±20 %, ±30 %) to test robustness. Baselines include a fixed‑time controller, the adaptive SCATS system, and a multi‑agent DQN approach where each intersection learns independently.

Results

Queue Reduction: The proposed single‑agent DreamerV3 policy reduces average queue length by 28 % compared with the fixed‑time baseline under nominal demand, and maintains a 22 % reduction even under the most severe (+30 %) demand surge.
Robustness to Fluctuations: While the MARL baseline degrades sharply when demand deviates by more than 20 %, the single‑agent method exhibits less than 5 % performance loss across all fluctuation levels.
Sample Efficiency: DreamerV3 requires roughly one‑tenth the number of environment steps to converge relative to model‑free methods, thanks to its latent‑world imagination capability.
Real‑Time Feasibility: Policy inference incurs ~5 ms per decision step on a standard CPU, indicating suitability for real‑time deployment.

Discussion
The study demonstrates that a centralized, single‑agent RL architecture can circumvent the coordination challenges of MARL while still achieving network‑wide optimization. The latent world model effectively anticipates traffic dynamics, enabling proactive signal adjustments that are resilient to sudden demand changes. However, the training regime assumes a static OD pattern; incorporating stochastic demand variations during learning would likely improve adaptability in real deployments. Moreover, reliance on probe‑vehicle data may limit applicability in regions with low penetration rates, suggesting a need for complementary sensing (e.g., loop detectors or camera‑based counts) or robust state‑estimation techniques.

Conclusion and Future Work
By integrating adjacency‑matrix encoding, queue‑based reward design, and DreamerV3’s model‑based learning, the authors present a new paradigm for regional traffic‑signal control that is both effective and robust to demand fluctuations. Future research directions include: (1) training with stochastic OD fluctuations to enhance generalization, (2) extending the framework to handle incident‑response scenarios (e.g., accidents, road closures), and (3) fusing additional data sources and employing graph neural networks for more accurate state estimation under sparse probe‑vehicle coverage. This work paves the way toward practical, AI‑driven traffic management systems compatible with emerging connected‑vehicle technologies.

💡 Research Summary

📜 Original Paper Content