Real-Time 2D LiDAR Object Detection Using Three-Frame RGB Scan Encoding
Indoor service robots need perception that is robust, more privacy-friendly than RGB video, and feasible on embedded hardware. We present a camera-free 2D LiDAR object detection pipeline that encodes short-term temporal context by stacking three consecutive scans as RGB channels, yielding a compact YOLOv8n input without occupancy-grid construction while preserving angular structure and motion cues. Evaluated in Webots across 160 randomized indoor scenarios with strict scenario-level holdout, the method achieves 98.4% mAP@0.5 (0.778 mAP@0.5:0.95) with 94.9% precision and 94.7% recall on four object classes. On a Raspberry Pi 5, it runs in real time with a mean post-warm-up end-to-end latency of 47.8ms per frame, including scan encoding and postprocessing. Relative to a closely related occupancy-grid LiDAR-YOLO pipeline reported on the same platform, the proposed representation is associated with substantially lower reported end-to-end latency. Although results are simulation-based, they suggest that lightweight temporal encoding can enable accurate and real-time LiDAR-only detection for embedded indoor robotics without capturing RGB appearance.
💡 Research Summary
The paper addresses the need for a perception system for indoor service robots that is robust, privacy‑preserving, and capable of running on low‑cost embedded hardware. While RGB cameras provide rich appearance information, they raise privacy concerns and are sensitive to lighting conditions. 2‑D LiDAR sensors, on the other hand, deliver geometric data without capturing visual appearance, but their sparse measurements have traditionally limited multi‑class object detection, especially when the processing pipeline requires dense occupancy‑grid construction that is computationally expensive.
The authors propose a novel, lightweight pipeline that directly converts three consecutive 2‑D LiDAR scans into a compact RGB‑style tensor and feeds it to a tiny YOLOv8n detector. Each scan is rasterized into a binary image of size 64 (range bins) × 360 (angular bins, 1° per column). To satisfy the stride requirements of the YOLO backbone, 24 zero‑columns are padded, yielding a 64 × 384 image. Three successive rasters are stacked as the red, green, and blue channels, forming a 64 × 384 × 3 tensor. This representation preserves the ordered angular structure of the original scan, injects short‑term motion cues through the colour channels, and requires no odometry alignment or scan matching.
The method is evaluated entirely in simulation using the Webots environment. The authors generate 160 randomized indoor scenarios, each with 90 predefined robot waypoints, resulting in 14,400 episodes and a total of 768,897 labeled samples. Objects belong to four classes (chair, box, desk, doorframe) and are randomly translated, rotated, and scaled to ensure diversity. Ground‑truth bounding boxes are automatically projected from the simulator’s object metadata into the raster coordinate system, with the middle scan (t‑1) used as the target label.
Training is performed on an NVIDIA H100 GPU using the Ultralytics YOLOv8 implementation (v8.3.152). The YOLOv8n model (≈3 M parameters) is trained for 120 epochs with a batch size of 1 024, AdamW optimizer (initial LR = 0.001) and cosine learning‑rate scheduling. Data augmentation is deliberately minimal: a left‑right flip along the angular axis (50 % probability) and an up‑down flip along the range axis (20 % probability), avoiding heavy image‑based augmentations that could break the geometric consistency between simulation and real world.
On a strict scenario‑level hold‑out test set (5 % of scenarios unseen during training), the model achieves an overall mAP@0.5 of 0.984 and mAP@0.5:0.95 of 0.778. Per‑class precision and recall range from 0.91 to 0.99, with the most frequent errors occurring between desk and doorframe due to similar 2‑D silhouettes from certain viewpoints. Background false detections remain below 1 %.
Real‑time performance is measured on three platforms. On a Raspberry Pi 5 (Broadcom BCM2712, quad‑core Cortex‑A76 @ 2.4 GHz), the complete pipeline—including rasterization, three‑frame stacking, YOLOv8n inference, and non‑maximum suppression—averages 47.8 ms per frame after a warm‑up run, comfortably meeting the target of sub‑50 ms (≈20 Hz) for indoor navigation loops. The same pipeline runs in 22 ms on a MacBook Air M2 and 112 ms on a Raspberry Pi 3 (using TorchScript). Compared to a recent occupancy‑grid based LiDAR‑YOLO approach evaluated on the same hardware, the proposed representation reduces preprocessing overhead and input size dramatically, leading to a noticeable latency reduction.
The authors acknowledge several limitations. All data are synthetic; real‑world LiDAR noise, surface reflectivity variations, and dynamic obstacles have not been tested. The three‑frame window captures only short‑term motion, which may be insufficient for fast‑moving objects or for leveraging longer temporal context. Future work could involve real‑robot deployments to assess domain transfer, exploration of adaptive frame windows, or integration of temporal neural architectures (e.g., Conv‑LSTMs or Transformers) to exploit longer sequences while preserving the lightweight nature of the pipeline.
In summary, the paper demonstrates that a simple three‑frame RGB encoding of 2‑D LiDAR data enables high‑accuracy, multi‑class object detection with a tiny YOLO model, achieving real‑time performance on low‑cost embedded platforms without the need for dense occupancy‑grid construction. This approach offers a promising path toward privacy‑friendly, perception‑only LiDAR solutions for indoor service robots.
Comments & Academic Discussion
Loading comments...
Leave a Comment