Occlusion-Aware Multimodal Beam Prediction and Pose Estimation for mmWave V2I
We propose an occlusion-aware multimodal learning framework that is inspired by simultaneous localization and mapping (SLAM) concepts for trajectory interpretation and pose prediction. Targeting mmWave vehicle-to-infrastructure (V2I) beam management under dynamic blockage, our Transformer-based fusion network ingests synchronized RGB images, LiDAR point clouds, radar range-angle maps, GNSS, and short-term mmWave power history. It jointly predicts the receive beam index, blockage probability, and 2D position using labels automatically derived from 64-beam sweep power vectors, while an offline LiDAR map enables SLAM-style trajectory visualization. On the 60 GHz DeepSense 6G Scenario 31 dataset, the model achieves 50.92% Top-1 and 86.50% Top-3 beam accuracy with 0.018 bits/s/Hz spectral-efficiency loss, 63.35% blocked-class F1, and 1.33m position RMSE. Multimodal fusion outperforms radio-only and strong camera-only baselines, showing the value of coupling perception and communication for future 6G V2I systems.
💡 Research Summary
The paper tackles the challenging problem of beam management for millimeter‑wave (mmWave) vehicle‑to‑infrastructure (V2I) links in dense urban environments where dynamic blockages frequently disrupt line‑of‑sight communication. Traditional approaches rely solely on radio feedback (e.g., angle‑of‑arrival, virtual anchors) or on single‑sensor SLAM pipelines, which are either brittle under fast blockage or ignore the communication objective entirely. To bridge this gap, the authors propose an occlusion‑aware multimodal learning framework inspired by SLAM concepts, where perception and communication are fused into a single Transformer‑based network.
System Overview
At each synchronized time step the system collects five modalities: (1) an RGB image, (2) a LiDAR point cloud, (3) a FMCW radar range‑angle magnitude map, (4) a GNSS‑derived 2‑D position, and (5) the previous time‑step’s 64‑beam received‑power vector (r_{t‑1}). Each modality is passed through a lightweight, trainable encoder (ResNet‑18 for images, PointNet‑style for LiDAR, a small CNN for radar, a two‑layer MLP for GNSS, and another MLP for the radio history) producing d‑dimensional tokens. The five tokens plus a learnable CLS token are stacked, enriched with modality‑type embeddings, and fed into a standard Vision‑Transformer encoder. The CLS output h_{CLS} acts as a compact multimodal state that simultaneously encodes geometric layout, occlusion cues, and short‑term radio context.
Three linear heads decode h_{CLS} into (i) beam logits (64 classes), (ii) a blockage logit, and (iii) a 2‑D pose estimate. Softmax and sigmoid produce the beam probability distribution π_t and blockage probability q_t. The model is trained with a multi‑task loss L = λ_beam·L_beam + λ_blk·L_blk + λ_pose·L_pose, where L_beam is cross‑entropy against the optimal beam (the beam with maximal received power in the full sweep), L_blk is binary cross‑entropy against a blockage label derived from a power‑threshold (lower 20 % of max powers), and L_pose is mean‑squared error against the ground‑truth GNSS position. The λ weights are tuned so that each term contributes comparably while giving priority to beam alignment.
Dataset and Experimental Setup
The authors evaluate on the DeepSense 6G Scenario 31 dataset, which provides synchronized mmWave sweeps, RGB, LiDAR, radar, and GNSS for a vehicle traversing an urban street. The dataset contains 7,012 snapshots, split 70 %/15 %/15 % into training, validation, and test sets, with temporal ordering preserved to keep r_{t‑1} truly preceding r_t. Because blocked samples are scarce, the blockage loss is up‑weighted to improve the blocked‑class F1 score.
Training uses AdamW with weight decay, gradient clipping, and a ReduceLROnPlateau scheduler. The model converges within ~10 epochs, after which validation loss plateaus while training loss continues to drop, indicating limited over‑fitting.
Results
- Beam Prediction: Top‑1 accuracy 50.92 % and Top‑3 accuracy 86.50 % on the test set, marginally better than the strongest unimodal baseline (camera‑only: 50.79 % / 86.03 %).
- Spectral‑Efficiency Loss: Average ΔSE = 0.018 bits/s/Hz, a slight improvement over camera‑only (0.019 bits/s/Hz).
- Blockage Detection: Blocked‑class F1 score 63.35 %, outperforming camera‑only (59.04 %) and all other single‑sensor models.
- Pose Estimation: 2‑D RMSE 1.33 m, a 37 % reduction compared to camera‑only (2.10 m).
The multimodal model matches vision‑only performance for beam selection (indicating that visual geometry dominates beam choice in this scenario) but gains substantially in blockage awareness and localization thanks to the complementary cues from LiDAR (depth), radar (range‑angle), GNSS (coarse absolute position), and short‑term radio history (direct blockage signal).
A qualitative SLAM‑style visualization overlays the predicted trajectory on an offline LiDAR map, showing that the learned representation respects the underlying street geometry and exhibits limited drift.
Discussion and Limitations
The work demonstrates that a shared latent state can simultaneously support communication‑centric tasks (beam selection, blockage prediction) and navigation‑centric tasks (pose regression). However, the current implementation only predicts planar (2‑D) positions, does not explicitly model dynamic objects, and lacks a detailed analysis of real‑time inference cost. Future directions include extending to full 3‑D pose, integrating dynamic object tracking, designing lightweight Transformer variants for edge deployment, and coupling the model with an online SLAM back‑end for closed‑loop V2I operation.
Conclusion
By fusing camera, LiDAR, radar, GNSS, and short‑term mmWave power history within a Transformer architecture, the authors achieve state‑of‑the‑art beam prediction, robust blockage detection, and accurate vehicle localization on a realistic 60 GHz V2I dataset. The results validate the premise that perception and communication should be co‑designed for future 6G ISAC systems, especially in environments where occlusion and rapid dynamics are the norm.
Comments & Academic Discussion
Loading comments...
Leave a Comment