LIEREx: Language-Image Embeddings for Robotic Exploration
Semantic maps allow a robot to reason about its surroundings to fulfill tasks such as navigating known environments, finding specific objects, and exploring unmapped areas. Traditional mapping approaches provide accurate geometric representations but are often constrained by pre-designed symbolic vocabularies. The reliance on fixed object classes makes it impractical to handle out-of-distribution knowledge not defined at design time. Recent advances in Vision-Language Foundation Models, such as CLIP, enable open-set mapping, where objects are encoded as high-dimensional embeddings rather than fixed labels. In LIEREx, we integrate these VLFMs with established 3D Semantic Scene Graphs to enable target-directed exploration by an autonomous agent in partially unknown environments.
💡 Research Summary
LIEREx (Language‑Image Embeddings for Robotic Exploration) presents a novel framework that fuses Vision‑Language Foundation Models (VLFMs) such as CLIP with 3‑D Semantic Scene Graphs (3DSSG) to enable open‑set, language‑driven exploration in partially unknown indoor environments. Traditional semantic mapping pipelines rely on a fixed set of object classes, limiting their ability to handle out‑of‑distribution concepts that were not predefined at design time. By encoding visual observations as high‑dimensional CLIP embeddings rather than discrete labels, LIEREx allows a robot to answer arbitrary natural‑language queries (“find a chair in the kitchenette”, “locate a red mug”) and retrieve matching nodes from a hierarchical graph that captures both spatial relationships and semantic context.
The system operates in four stages. First, incoming RGB‑D frames are processed by a class‑agnostic segmentation model (e.g., Mask2Former) to generate object masks without relying on predefined categories. Second, each mask is fed through CLIP’s image encoder to obtain a feature vector, which is attached to the corresponding node in the 3DSSG. The graph itself is a heterogeneous, multi‑layer structure that represents low‑level entities (individual objects) up to high‑level concepts (rooms, zones) and is continuously updated as new observations arrive. Third, a user‑provided textual query is encoded with CLIP’s text encoder; cosine similarity between the query vector and all node embeddings yields a ranked list of candidate objects. Fourth, instead of exhaustive geometric evaluation (e.g., TSDF ray‑casting) to decide where the robot should look, LIEREx introduces a View Quality Estimation (VQE) module. VQE is a neural network trained on large‑scale simulated data (Habitat + HM3D) that predicts a quality score for any candidate observation pose given a query. Training is self‑supervised: rendered views of candidate poses are encoded with CLIP, compared to the query embedding, and a cosine loss drives the network to assign higher scores to views that make the queried concept visually distinctive.
During exploration, the planner combines a conventional frontier‑based strategy with the VQE scores. When a concrete target exists, the planner prioritizes high‑quality poses suggested by VQE, effectively “looking” at the most informative viewpoints. When no specific target is known, weighted frontier exploration continues to map unknown space. This hybrid approach reduces the computational burden of online ray‑casting while still respecting both semantic relevance and geometric feasibility.
Experiments were conducted in two domains. In simulation, thousands of query‑map pairs were generated from the HM3D dataset; VQE‑guided pose selection achieved up to 20 % higher success rates and reduced the average number of steps to locate a target by 1.8× compared to baseline geometric heuristics. Real‑world validation used a TIAGo 2 robot equipped with an Ouster OS0 LiDAR and a Femto‑Bolt ToF RGB‑D camera. The robot operated in a pre‑mapped office building, executing queries such as “kitchenette” and “chair”. Using VQE‑derived observation poses, the robot successfully identified the requested objects while decreasing total travel distance and exploration time by roughly 22 % and 18 %, respectively, relative to a purely frontier‑based planner.
The paper also discusses limitations. CLIP provides only global image embeddings, which can struggle with small or heavily occluded objects; integrating region‑level VLFM variants (e.g., RegionCLIP, SAM) could improve granularity. VQE’s performance depends on the diversity of simulated training data, so domain shift to environments with different lighting, textures, or layout may degrade accuracy, suggesting future work on domain‑adaptive or online self‑supervised learning. Finally, continuously updating high‑dimensional embeddings in a 3DSSG adds computational overhead to SLAM pipelines; lightweight graph update schemes or asynchronous processing are potential remedies.
Overall, LIEREx demonstrates that coupling vision‑language embeddings with structured 3‑D semantic representations enables robots to interpret and act upon open‑set natural‑language goals, moving beyond rigid label‑based maps toward more flexible, human‑centric navigation and exploration capabilities.
Comments & Academic Discussion
Loading comments...
Leave a Comment