FlexMap: Generalized HD Map Construction from Flexible Camera Configurations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-definition (HD) maps provide essential semantic information of road structures for autonomous driving systems, yet current HD map construction methods require calibrated multi-camera setups and either implicit or explicit 2D-to-BEV transformations, making them fragile when sensors fail or camera configurations vary across vehicle fleets. We introduce FlexMap, unlike prior methods that are fixed to a specific N-camera rig, our approach adapts to variable camera configurations without any architectural changes or per-configuration retraining. Our key innovation eliminates explicit geometric projections by using a geometry-aware foundation model with cross-frame attention to implicitly encode 3D scene understanding in feature space. FlexMap features two core components: a spatial-temporal enhancement module that separates cross-view spatial reasoning from temporal dynamics, and a camera-aware decoder with latent camera tokens, enabling view-adaptive attention without the need for projection matrices. Experiments demonstrate that FlexMap outperforms existing methods across multiple configurations while maintaining robustness to missing views and sensor variations, enabling more practical real-world deployment.

💡 Research Summary

FlexMap addresses a critical gap in autonomous‑driving perception: the ability to generate high‑definition (HD) vectorized maps from arbitrary, uncalibrated camera setups. Traditional HD‑map construction pipelines such as MapTR, GeMap, or StreamMapNet rely on a fixed multi‑camera rig and an explicit 2D‑to‑BEV projection that requires precise intrinsics and extrinsics. This makes them fragile when cameras fail, are mis‑calibrated, or when fleet vehicles carry heterogeneous sensor suites.
The proposed system eliminates the need for any geometric projection matrices. It builds on a geometry‑aware foundation model (VGGT‑style) that processes each image as a set of visual tokens produced by a DINOv2 encoder. A learnable “camera token” is prepended to each view’s token sequence, allowing the model to implicitly learn view‑specific geometry without ever receiving calibration data. The backbone alternates between frame‑level self‑attention (capturing intra‑view spatial structure) and global cross‑view attention (capturing inter‑view and temporal relationships).
FlexMap’s core novelty lies in its Spatial‑Temporal Enhancement module. Instead of mixing spatial and temporal cues in a single attention block, it explicitly separates them: (1) Cross‑view attention operates on frames that share the same timestamp, enabling the model to fuse overlapping fields of view and resolve spatial ambiguities; (2) Temporal attention operates on frames from the same camera across different timestamps, preserving motion continuity and filtering out transient objects. This separation lets the network treat static road markings and dynamic traffic participants with appropriate context.
The decoder is “camera‑aware”. It receives hierarchical map queries together with the latent camera tokens. These tokens condition the attention mechanism so that each query can adapt its focus based on which cameras are present, their typical orientation, and the density of observable map elements. Consequently, the decoder can produce BEV‑consistent polylines even when some cameras are missing, because the remaining tokens supply enough geometric grounding.
Experiments on the large‑scale nuScenes and Argoverse datasets evaluate a wide range of configurations: the full six‑camera rig (6PV), single‑front‑camera (1PV), and random subsets that simulate sensor loss. FlexMap consistently outperforms baseline methods in mean Average Precision (mAP) and Intersection‑over‑Union (IoU) by 5–8 percentage points across all settings. When synthetic calibration noise (±5°/±10°) is injected, performance degrades only marginally, confirming robustness to mis‑calibration. In ablation studies, removing the camera tokens or merging spatial and temporal attention leads to noticeable drops, highlighting the importance of both design choices. Qualitative visualizations show that FlexMap avoids the boundary fragmentation observed in BEV‑based decoders, delivering smoother, more complete lane and divider polylines.
Limitations include reliance on a backbone trained primarily for static scene reconstruction, which may struggle under extreme lighting changes or highly dynamic environments. The latent camera tokens are learned from the training distribution; completely novel camera placements (e.g., roof‑mounted upward‑facing lenses) could require fine‑tuning. Future work is suggested in three directions: (i) integrating explicit dynamic‑object segmentation to better separate moving agents from permanent road geometry; (ii) employing meta‑learning or prompt‑style adaptation for rapid token re‑initialization on unseen camera layouts; and (iii) extending the framework to fuse low‑cost lidar or radar data for further resilience. Overall, FlexMap represents a significant step toward scalable, cost‑effective HD‑map generation that can be deployed on heterogeneous vehicle fleets without the burden of precise sensor calibration.

FlexMap: Generalized HD Map Construction from Flexible Camera Configurations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment