RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Ride-hailing platforms face the challenge of balancing passenger waiting times with overall system efficiency under highly uncertain supply-demand conditions. Adaptive delayed matching creates a trade-off between matching and pickup delays by deciding whether to assign drivers immediately or batch requests. Since outcomes accumulate over long horizons with stochastic dynamics, reinforcement learning (RL) is a suitable framework. However, existing approaches often oversimplify traffic dynamics or use shallow encoders that miss complex spatiotemporal patterns. We introduce the Regime-Aware Spatio-Temporal Mixture-of-Experts (RAST-MoE), which formalizes adaptive delayed matching as a regime-aware MDP equipped with a self-attention MoE encoder. Unlike monolithic networks, our experts specialize automatically, improving representation capacity while maintaining computational efficiency. A physics-informed congestion surrogate preserves realistic density-speed feedback, enabling millions of efficient rollouts, while an adaptive reward scheme guards against pathological strategies. With only 12M parameters, our framework outperforms strong baselines. On real-world Uber trajectory data (San Francisco), it improves total reward by over 13%, reducing average matching and pickup delays by 10% and 15% respectively. It demonstrates robustness across unseen demand regimes and stable training. These findings highlight the potential of MoE-enhanced RL for large-scale decision-making with complex spatiotemporal dynamics.

💡 Research Summary

This paper addresses the critical operational challenge in ride-hailing platforms known as “adaptive delayed matching” – the decision of whether to match a driver to a passenger request immediately or to hold the request for batching to achieve better system-wide efficiency. This creates a fundamental trade-off between matching delay (wait until assignment) and pickup delay (travel time after assignment). The problem is characterized by long horizons, stochastic demand and supply, and complex spatiotemporal dynamics, making Reinforcement Learning (RL) a suitable framework.

The authors identify key limitations in prior RL approaches: oversimplified traffic dynamics that misrepresent congestion, and shallow policy encoders that fail to capture complex spatiotemporal patterns. To overcome these, the paper makes three core contributions:

Formulation of a Regime-Aware Spatio-Temporal MDP (RAST-MDP): The problem is formalized as an MDP whose state space explicitly captures heterogeneous “regimes” – distinct patterns of supply-demand imbalance that vary by time of day (e.g., morning peak, off-peak) and location. The action is a binary vector per zone (“match now” or “hold”), creating a large combinatorial action space. The reward function is carefully designed to balance incremental matching and pickup delay costs while incorporating an adaptive penalty for service-quality violations (e.g., too many excessively delayed pickups). This adaptive component uses an online Lagrange multiplier to dynamically adjust the penalty strength, effectively preventing the RL agent from developing pathological “reward-hacking” strategies like indefinitely holding requests.
A Physics-Informed, Scalable Travel Time Surrogate Model: Recognizing that microscopic traffic simulation is too costly for millions of RL rollouts, the authors develop an efficient surrogate model based on Macroscopic Fundamental Diagrams (MFD). It aggregates traffic flow data into zone-hour level average speeds, preserving the essential density-speed feedback (congestion slows speed). Travel times for all origin-destination pairs are precomputed offline for each hour of the day, allowing for O(1) queries during RL training. This provides realistic congestion signals while maintaining scalability.
The RAST-MoE-RL Framework: The central innovation is the Regime-Aware Spatio-Temporal Mixture-of-Experts (RAST-MoE) encoder. This replaces the standard shared feature extractor in an actor-critic RL setup (using PPO as the primary trainer). The encoder first tokenizes the state of each spatial zone using a lightweight transformer. Then, a MoE layer processes a pooled global state representation. The MoE layer consists of multiple expert neural networks and a router. For each input, the router selects only the top-K most relevant experts (e.g., top-2), whose outputs are combined. This design allows different experts to automatically specialize in handling different demand-supply regimes (e.g., one expert for peak congestion, another for sparse demand), dramatically increasing the model’s representational capacity without a proportional increase in computational cost per sample. The framework remains compact, with only 12M parameters.

The proposed method is evaluated on real-world Uber trajectory data from San Francisco. Despite its modest size, RAST-MoE-RL consistently outperforms strong baselines. It improves the total reward by over 13%, while reducing the average matching delay by 10% and the average pickup delay by 15%. The framework demonstrates robustness when evaluated under unseen demand regimes, exhibits stable training without reward hacking, and analysis confirms that different MoE experts do specialize in distinct operating regimes. The findings highlight the significant potential of combining MoE architectures with RL for large-scale decision-making problems involving complex spatiotemporal dynamics and large discrete action spaces.

RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment