AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models
End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable transformer mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird’s-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM decoder fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.
💡 Research Summary
AppleVLM presents a comprehensive solution to the shortcomings of existing vision‑language models (VLMs) for end‑to‑end autonomous driving. The system first processes multi‑view RGB images and LiDAR point clouds through two RegNet64 backbones, normalizing images and voxelizing point clouds before fusing them with cross‑attention. A deformable transformer then performs spatial‑cross and temporal‑self attention across both view and time dimensions, yielding features that are robust to camera placement and resolution changes.
To mitigate the inherent ambiguity of natural‑language navigation commands, AppleVLM introduces a dedicated planning‑strategy encoder. This module consumes the BEV (bird’s‑eye‑view) features from the vision encoder and generates explicit planning‑template tokens that encode lane geometry, intersection zones, and maneuver corridors. These tokens provide precise spatial cues that complement the language input.
Vision, language, and planning tokens are merged by a Q‑Former multimodal attention layer and fed into a large‑scale VLM decoder (e.g., LLaVA or Janus Pro). The decoder is fine‑tuned with a hierarchical Chain‑of‑Thought (CoT) approach on a real‑world corner‑case dataset, training it to reason through three stages: general perception, region perception, and driving suggestion. This CoT fine‑tuning dramatically improves out‑of‑distribution robustness, especially for rare events such as sudden pedestrian crossings.
Training proceeds in four stages: (1) pre‑train the vision encoder on BEV prediction, (2) train the planning encoder using frozen vision features, (3) fine‑tune the VLM decoder with CoT on corner cases, and (4) end‑to‑end behavior‑cloning of waypoints while keeping the vision and planning encoders frozen. Losses combine waypoint regression and planning‑template alignment.
Evaluation on CARLA’s Longest6 and NoCrash benchmarks shows AppleVLM surpasses prior state‑of‑the‑art VLM‑based driving models in success rate, collision avoidance, and lane‑keeping, particularly in multi‑lane intersections where language bias previously caused lane oscillations. Real‑world deployment on an autonomous ground vehicle (AGV) demonstrates stable closed‑loop operation in complex outdoor environments, with the generated waypoints driving an LQR controller for steering, throttle, and brake commands.
In summary, AppleVLM’s three‑pronged architecture—deformable spatio‑temporal vision fusion, explicit BEV planning tokens, and CoT‑enhanced VLM decoding—delivers a robust, scalable, and interpretable end‑to‑end autonomous driving system, paving the way for broader real‑world adoption. Future work will explore additional sensors (radar, ultrasonic) and dynamic learning of planning templates to further improve multimodal generalization.
Comments & Academic Discussion
Loading comments...
Leave a Comment