Safe mobility support system using crowd mapping and avoidance route planning using VLM
Autonomous mobile robots offer promising solutions for labor shortages and increased operational efficiency. However, navigating safely and effectively in dynamic environments, particularly crowded areas, remains challenging. This paper proposes a novel framework that integrates Vision-Language Models (VLM) and Gaussian Process Regression (GPR) to generate dynamic crowd-density maps (``Abstraction Maps’’) for autonomous robot navigation. Our approach utilizes VLM’s capability to recognize abstract environmental concepts, such as crowd densities, and represents them probabilistically via GPR. Experimental results from real-world trials on a university campus demonstrated that robots successfully generated routes avoiding both static obstacles and dynamic crowds, enhancing navigation safety and adaptability.
💡 Research Summary
The paper addresses a critical gap in autonomous mobile robot navigation: the ability to perceive and avoid crowds, an abstract and dynamic environmental factor that traditional perception pipelines struggle to represent. While prior work has focused on static obstacle avoidance or on modeling individual agents with sophisticated probabilistic or reinforcement‑learning methods, few approaches have treated “crowd” as a high‑level semantic concept that can be directly incorporated into a cost map for path planning.
To solve this, the authors propose a framework that couples a Vision‑Language Model (VLM) with Gaussian Process Regression (GPR) to generate what they call an “Abstraction Map”. The pipeline works as follows: a forward‑facing RGB camera on the robot captures an image; a textual prompt (e.g., “Is there a crowd in front of the robot?”) is concatenated with the image and fed to a VLM (specifically, GPT‑4o‑mini, which accepts multimodal inputs). The VLM returns a binary label—“crowd” or “free”. This labeling is performed every three meters along the robot’s trajectory, producing a sparse set of crowd‑presence observations.
Because these observations are spatially sparse, the authors employ Gaussian Process Regression to interpolate a continuous probability density over the entire navigation grid. The mean of the GP provides a smooth crowd‑cost surface, while the variance supplies a confidence measure that can be used for risk‑aware decision making. The GP uses an RBF kernel, reflecting the assumption that crowd density varies smoothly in space and time.
The resulting crowd cost map is then merged with a conventional geometric map (derived from a 3D LiDAR) on a common grid. Both layers are weighted (default 1:1) and summed to produce a final cost map. Standard Dijkstra’s algorithm is applied to this map to compute the shortest‑cost path from a start to a goal location. Because the cost map is represented as an 8‑bit grayscale image, the planning step remains lightweight and compatible with existing robot software stacks.
Experimental validation was carried out on a university campus using a custom‑built robot equipped with a 3D LiDAR, two 2D LiDARs, an IMU, and a front‑mounted RGB camera. Two controlled crowd scenarios were created by gathering a small group of students at distinct locations. The VLM correctly identified crowds when they were within close range of the camera, but labeled distant scenes as “free”, reflecting the limited field of view. After GPR interpolation, the crowd cost peaked slightly ahead of the actual crowd location due to the forward‑looking camera perspective, illustrating the importance of aligning perception and localization frames.
Path planning results showed that with equal weighting of geometric and crowd costs, the robot successfully avoided both static obstacles and the crowds, reaching its goal without collisions. When the crowd weight was increased dramatically (geometric : crowd = 1 : 9), the planner ignored crowd costs, resulting in routes that passed through dense groups—demonstrating the sensitivity of the system to weight selection. The authors note that the current implementation treats crowds as quasi‑static; however, by increasing the frequency of VLM queries and GPR updates, the method could be extended to fully dynamic crowd navigation.
Key contributions include: (1) introducing VLMs as a bridge between raw visual data and high‑level semantic labels for crowd detection, (2) leveraging GPR to transform sparse binary labels into a probabilistic, confidence‑aware cost surface, and (3) integrating this abstract layer with traditional geometric maps without altering the underlying path‑planning algorithm.
Limitations are acknowledged: reliance on a forward‑facing camera leads to blind spots behind the robot, and the binary “crowd/free” output discards finer density information that could enable more nuanced navigation strategies. Future work is suggested to incorporate omnidirectional vision, additional sensing modalities such as Wi‑Fi CSI or CO₂ concentration, and to explore reinforcement‑learning policies that can adaptively re‑plan as crowd patterns evolve.
In summary, the paper presents a novel, practical approach for embedding abstract social context—specifically crowd presence—into robot navigation. By combining state‑of‑the‑art multimodal language‑vision models with probabilistic regression, the authors demonstrate that robots can achieve safer, more socially aware motion in environments where human density fluctuates, paving the way for broader deployment of service robots in public spaces.
Comments & Academic Discussion
Loading comments...
Leave a Comment