OMCL: Open-vocabulary Monte Carlo Localization
Robust robot localization is an important prerequisite for navigation planning. If the environment map was created from different sensors, robot measurements must be robustly associated with map features. In this work, we extend Monte Carlo Localization using vision-language features. These open-vocabulary features enable to robustly compute the likelihood of visual observations, given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. The abstract vision-language features enable to associate observations and map elements from different modalities. Global localization can be initialized by natural language descriptions of the objects present in the vicinity of locations. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.
💡 Research Summary
The paper introduces OMCL (Open‑vocabulary Monte Carlo Localization), a novel framework that integrates vision‑language models (VLMs) such as CLIP into the classic Monte Carlo Localization pipeline. The core idea is to store high‑dimensional, modality‑agnostic visual‑language embeddings in a spatial octree structure—called the Octree Language Map—where each voxel holds a feature vector representing the semantic content of that region.
Two mapping pipelines are provided. In the first, RGB‑D frames with known poses are processed by LSeg to obtain pixel‑wise embeddings; depth information projects these embeddings into the octree, averaging or replacing existing vectors based on a cosine‑distance threshold τ. In the second pipeline, pre‑computed point clouds are fed to OpenScene, which directly predicts embeddings for each point; these are then aggregated into the octree. This dual approach enables the creation of a unified map from heterogeneous sensors (RGB‑D cameras, LiDAR, or any point‑cloud source) while preserving a compact representation.
For localization, a particle filter samples candidate camera poses. For each particle, rays are cast from the hypothesized pose using camera intrinsics; the first intersected voxel’s embedding γ is retrieved. The current RGB image is processed by the same VLM to extract per‑pixel embeddings φ. The particle weight is updated by the average cosine similarity between φ and γ across all sampled rays, effectively measuring how well the observed visual‑language features align with the map. The final pose estimate is the weighted mean of the most probable particles.
A key contribution is stratified ray sampling. Uniform sampling would over‑represent large, texture‑less surfaces (walls, floors) and under‑represent small but distinctive objects. The authors cluster image pixels based on similarity to the map’s feature database (Features DB) and then uniformly sample an equal number of pixels from each cluster, discarding duplicates. This per‑cluster sampling preserves information from all semantic elements and reduces computational load.
Another innovative component is prompt‑augmented global initialization. Instead of requiring geometric coordinates or random particle distribution, users can provide natural‑language prompts describing the surroundings (e.g., “toilet, mirror, towel”). The text encoder converts each word into an embedding; the system then finds floor voxels whose neighboring voxels have high similarity to these embeddings. Particles are initialized around these matched locations, allowing non‑expert users to launch global localization with simple language. This also enables integration with large language models for autonomous agent‑driven initialization.
The authors evaluate OMCL on indoor datasets (Matterport3D, Replica) and the outdoor SemanticKITTI benchmark. They compare against state‑of‑the‑art semantic localization methods (SeD‑AR, SIMP) and LiDAR‑to‑camera map matching approaches (CMRNext). OMCL consistently outperforms baselines, especially in cross‑modal scenarios where the map and observations come from different sensors. Ablation studies demonstrate the benefits of (1) using the full feature database versus prompt‑filtered subsets, (2) stratified sampling versus uniform sampling, and (3) language‑based initialization versus random seeding. Each component contributes measurable gains in positional error, orientation error, and convergence speed.
Limitations include the need for accurate pose information during the mapping phase and the current separation of mapping and localization (no online SLAM integration). Very generic prompts can also lead to ambiguous initialization. Future work aims to fuse OMCL with real‑time SLAM, handle dynamic objects, and learn adaptive weighting schemes for multi‑modal consistency.
In summary, OMCL advances robot localization by (i) representing maps with open‑vocabulary visual‑language embeddings, (ii) leveraging ray‑traced feature consistency for particle weighting, (iii) enabling natural‑language driven global initialization, and (iv) employing stratified sampling for efficient and balanced observation processing. This framework bridges sensor heterogeneity, improves robustness, and opens the door to more intuitive human‑robot interaction through language.
Comments & Academic Discussion
Loading comments...
Leave a Comment