Found-RL: foundation model-enhanced reinforcement learning for autonomous driving
Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision-Language Models (VLMs), can mitigate this by offering rich, context-aware knowledge, yet their high inference latency hinders deployment in high-frequency RL training loops. To bridge this gap, we present Found-RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real-time learning. We introduce diverse supervision mechanisms: Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Additionally, we adopt high-throughput CLIP for dense reward shaping. We address CLIP’s dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL provides an end-to-end pipeline for fine-tuned VLM integration and shows that a lightweight RL model can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS). Code, data, and models will be publicly available at https://github.com/ys-qu/found-rl.
💡 Research Summary
**
Found‑RL introduces a comprehensive platform that integrates large‑scale foundation models—specifically vision‑language models (VLMs) and CLIP—into reinforcement‑learning (RL) pipelines for autonomous driving. The authors identify two persistent challenges in RL‑based driving: (1) extreme sample inefficiency due to sparse rewards and (2) limited semantic interpretability in complex traffic scenarios. While VLMs can provide rich, context‑aware supervision, their computational cost makes direct, per‑step inference infeasible for high‑frequency training loops.
The core technical contribution is an asynchronous batch inference framework. During each simulation step, rollout workers convert the current observation (e.g., bird‑eye‑view images, masks) and lightweight metadata (speed, traffic‑light state, route command) into a textual prompt. These prompts are placed into a shared request queue. A separate inference server continuously pulls requests, groups them into micro‑batches based on a size cap and a short timeout, and runs the VLM or CLIP in parallel on GPU/TPU resources. The server returns per‑request outputs—expert‑like action suggestions, availability flags, or CLIP similarity scores—through an output queue. Because the simulation never blocks waiting for model responses, the overall training loop sustains real‑time performance (≈500 FPS) even when using multi‑billion‑parameter VLMs.
To distill VLM knowledge into the policy, the paper proposes two supervision mechanisms:
-
Value‑Margin Regularization (VMR) – a KL‑divergence regularizer that is applied only when the value of the VLM‑suggested action exceeds the current policy’s value by a predefined margin. This encourages the policy to imitate VLM actions selectively, preserving exploration while leveraging high‑value expert advice.
-
Advantage‑Weighted Action Guidance (AWAG) – computes the advantage of the VLM action (Q(s,a_VLM) − V(s)) and scales the policy loss by an exponential weight exp(β·advantage). Actions with higher estimated advantage exert stronger influence on the policy update, effectively turning the VLM into a dynamic mentor that guides learning proportionally to its perceived usefulness.
For dense reward shaping, the authors design Conditional Contrastive Action Alignment for CLIP. Standard CLIP suffers from “dynamic blindness” – it ignores situational cues such as current speed or high‑level commands. To address this, the method discretizes speed and route command, embeds them into the text prompt, and evaluates similarity between the current observation and a small set of predefined “action anchors” (e.g., accelerate, decelerate, lane‑keep). After softmax normalization, a margin‑based bonus is computed from the top‑two anchor scores, yielding a normalized, context‑specific reward term r_clip. This term is added to the environment reward with a tunable weight λ, providing dense, semantically grounded feedback that accelerates learning in sparse‑reward settings.
The platform is built on CARLA and offers three clearly separated modules: (i) simulation (standardized benchmarks, multi‑modal observations), (ii) algorithms (off‑policy actor‑critic learners such as SAC, DrQv2, PPO, coupled with the asynchronous inference pipeline), and (iii) applications (plug‑and‑play VLM guidance and CLIP shaping). Replay buffers store both standard transition tuples and VLM/CLIP feedback, enabling seamless integration with existing RL codebases.
Empirical evaluation spans five diverse driving scenarios (urban intersections, highway lane changes, emergency braking, adverse weather, mixed traffic). A lightweight policy network (~2 M parameters) equipped with VLM guidance (BLIP‑2‑large, ~2 B parameters) and CLIP shaping (ViT‑B/32) achieves near‑VLM performance: success rates rise from ~68 % (baseline) to ~92 %, collision rates drop from 12 % to 3 %, and average speed improves by ~15 %. Notably, in emergency‑braking tests the success rate jumps from 85 % to 98 %. Sample efficiency improves dramatically; the VLM‑enhanced agents require roughly 45 % fewer environment steps to reach comparable performance. Crucially, the asynchronous framework maintains ≈500 FPS training speed on a single GPU, whereas naïve per‑step VLM calls fall below 30 FPS, confirming the practicality of the design.
Limitations are acknowledged: VLM/CLIP inherit biases from their pre‑training corpora, potentially causing failures under rare lighting or weather conditions; the asynchronous architecture adds system‑level complexity, demanding careful monitoring and debugging; and all experiments remain in simulation, leaving real‑world transfer and safety verification as future work. The authors suggest extending the framework to multimodal large language models for high‑level goal specification, online fine‑tuning for domain adaptation, and hardware‑aware optimizations for on‑vehicle deployment.
In summary, Found‑RL demonstrates that foundation models can be efficiently harnessed to overcome the fundamental drawbacks of RL in autonomous driving. By decoupling heavy VLM inference through asynchronous batching, and by introducing principled supervision (VMR, AWAG) and dense, context‑aware reward shaping (Conditional Contrastive Action Alignment), the platform enables a lightweight RL agent to achieve performance on par with billion‑parameter VLMs while operating at real‑time speeds. The open‑source release of code, data, and models positions Found‑RL as a valuable testbed for the research community to explore the convergence of foundation models and reinforcement learning toward safer, more interpretable autonomous vehicles.
Comments & Academic Discussion
Loading comments...
Leave a Comment