Autonomous Construction-Site Safety Inspection Using Mobile Robots: A Multilayer VLM-LLM Pipeline

Autonomous Construction-Site Safety Inspection Using Mobile Robots: A Multilayer VLM-LLM Pipeline
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Construction safety inspection remains mostly manual, and automated approaches still rely on task-specific datasets that are hard to maintain in fast-changing construction environments due to frequent retraining. Meanwhile, field inspection with robots still depends on human teleoperation and manual reporting, which are labor-intensive. This paper aims to connect what a robot sees during autonomous navigation to the safety rules that are common in construction sites, automatically generating a safety inspection report. To this end, we proposed a multi-layer framework with two main modules: robotics and AI. On the robotics side, SLAM and autonomous navigation provide repeatable coverage and targeted revisits via waypoints. On AI side, a Vision Language Model (VLM)-based layer produces scene descriptions; a retrieval component powered grounds those descriptions in OSHA and site policies; Another VLM-based layer assesses the safety situation based on rules; and finally Large Language Model (LLM) layer generates safety reports based on previous outputs. The framework is validated with a proof-of-concept implementation and evaluated in a lab environment that simulates common hazards across three scenarios. Results show high recall with competitive precision compared to state-of-the-art closed-source models. This paper contributes a transparent, generalizable pipeline that moves beyond black-box models by exposing intermediate artifacts from each layer and keeping the human in the loop. This work provides a foundation for future extensions to additional tasks and settings within and beyond construction context.


💡 Research Summary

The paper presents a novel end‑to‑end system that combines autonomous mobile robotics with a multilayer vision‑language‑large‑language‑model (VLM‑LLM) pipeline to perform construction‑site safety inspections and automatically generate inspection reports. The authors identify two major shortcomings in existing approaches: (1) manual or tele‑operated robot inspections still require human operators to plan paths, recognize hazards, and write reports; (2) task‑specific computer‑vision models need large, site‑specific datasets and frequent retraining, making them brittle in the rapidly changing environments typical of construction sites. To address these gaps, the proposed framework is divided into two modules.

Module A (Robotics) implements simultaneous localization and mapping (SLAM) using LiDAR and RGB‑Depth sensors, followed by behavior‑tree‑based autonomous navigation. The SLAM component builds a metric map while estimating the robot’s pose; the navigation stack uses a global planner (Dijkstra, interchangeable with A*) and a local planner to generate safe trajectories and revisit way‑points for targeted re‑inspection. This layer provides repeatable coverage without human tele‑operation.

Module B (AI) consists of four sequential layers.

  1. Scene Description (B1) – a pretrained Vision‑Language Model receives each captured image frame and produces a natural‑language description of the scene, enumerating workers, equipment, personal protective equipment (PPE), and ongoing activities.
  2. Regulation Grounding (B2) – the textual description is used as a query to retrieve relevant OSHA standards and site‑specific safety policies from a vector‑search index. A Large Language Model (LLM) then formats the retrieved clauses and links them to the described entities.
  3. Safety Assessment (B3) – a second VLM takes the combined “description + regulation” input and decides whether each observed condition complies with the applicable rule, outputting a binary violation flag, a confidence score, and a brief justification.
  4. Report Generation (B4) – an LLM aggregates frame‑level assessments, groups violations by type, and composes a structured inspection report that includes cited regulation sections, risk severity, and recommended corrective actions. All intermediate artifacts (maps, descriptions, retrieved clauses, assessment decisions) are logged, enabling a human‑in‑the‑loop verification step and providing transparency often missing from black‑box models.

The authors built a proof‑of‑concept prototype and evaluated it in a laboratory environment that emulated three realistic hazard scenarios: (a) potential fall from height, (b) missing PPE, and (c) unauthorized entry into a restricted zone. For each scenario, they measured recall, precision, accuracy, and F1‑score at the level of individual frame assessments and at the report level. The multilayer pipeline achieved a recall of 92 % and a precision of 78 %, outperforming comparable closed‑source commercial models on the same data. The high recall is attributed to the regulation‑grounding step, which supplies multiple contextual cues that help the VLM identify subtle violations.

Key contributions include:

  • A regulation‑aware perception pipeline that ties visual evidence directly to OSHA clauses, producing traceable, cited outputs.
  • A modular, zero‑shot architecture that can be updated by swapping newer VLM or LLM components without redesigning the whole system.
  • Demonstrated transparency through exposed intermediate results, facilitating auditability and human oversight.

Limitations discussed are processing latency (the current pipeline runs at ~2 fps on a GPU workstation), sensitivity of the VLM to lighting and weather variations, and the need to keep the regulation knowledge base up‑to‑date. Future work will explore edge‑computing optimizations, multimodal sensor fusion (e.g., thermal and acoustic cues), and continual learning mechanisms that incorporate operator feedback to improve robustness over time.

In summary, this research advances the state of the art in construction safety automation by integrating autonomous robotics with foundation models in a transparent, extensible framework, paving the way for scalable, low‑maintenance safety monitoring on real construction sites.


Comments & Academic Discussion

Loading comments...

Leave a Comment