Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning
Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.
💡 Research Summary
The paper introduces Neural Sentinel, a unified Vision‑Language Model (VLM) framework that replaces the traditional multi‑stage pipeline used for Automatic License Plate Recognition (ALPR). Conventional systems cascade separate modules—vehicle detection, plate localization, character segmentation, and OCR—so that errors propagate, computational cost multiplies, and extending functionality requires additional bespoke components. Neural Sentinel instead treats ALPR as a multi‑task visual question‑answering problem: a single forward pass of a fine‑tuned PaliGemma 3B model receives an image and natural‑language prompts (e.g., “What is the license‑plate number?”) and outputs textual answers for plate number, state, make/model, and color.
Key technical contributions are:
-
Model selection and adaptation – PaliGemma 3B combines a SigLIP Vision Transformer (ViT‑L/14) with the Gemma language decoder, offering strong scene‑text capabilities while remaining lightweight enough for real‑time inference (<200 ms on a consumer GPU). The authors apply Low‑Rank Adaptation (LoRA) to all attention projection and feed‑forward layers with rank = 16 and scaling factor α = 32, training only ~0.27 % of the 3 B parameters (≈8 M). This preserves the broad visual‑language knowledge acquired during pre‑training while specializing the model for license‑plate character patterns.
-
Human‑in‑the‑Loop continual learning (HITL‑CL) – Operational staff can correct low‑confidence or erroneous outputs. Corrections are stored in an experience‑replay buffer. During incremental updates, the buffer is mixed with the original training set in a 70:30 ratio, mitigating catastrophic forgetting and allowing the model to adapt to distribution shifts without full retraining. The update process is trigger‑based rather than batch‑scheduled, keeping latency low for production deployment.
-
Comprehensive evaluation – Using 12 000 real‑world toll‑plaza images covering diverse lighting, angles, and weather, Neural Sentinel achieves 92.3 % plate‑recognition accuracy, a 14.1 % gain over EasyOCR (78.2 %) and a 9.9 % gain over PaddleOCR (82.4 %). Mean inference latency drops to 152 ms, a 43 % reduction versus a conventional pipeline (~260 ms). Calibration analysis yields an Expected Calibration Error of 0.048, indicating reliable confidence scores. Zero‑shot tests on auxiliary tasks—vehicle color (89 % accuracy), seat‑belt detection (82 %), and occupancy counting (78 %)—demonstrate emergent capabilities without task‑specific fine‑tuning.
-
Analysis of limitations and future work – The 3 B model fits within a 12 GB GPU memory budget but struggles with ultra‑high‑resolution frames and extreme weather conditions. The replay buffer size is bounded, so long‑term drift mitigation will require smarter sample selection or hierarchical memory. Future directions include scaling to PaliGemma 7B, applying multimodal compression to further reduce latency, and integrating automated labeling pipelines (e.g., RLHF‑style feedback) to streamline the HITL loop.
In summary, Neural Sentinel validates that a VLM‑first architecture can deliver superior accuracy, lower latency, and built‑in multi‑task flexibility for ALPR, marking a paradigm shift from specialized cascades to a single, language‑driven visual reasoning engine. This approach promises easier expansion to related traffic‑monitoring tasks and smoother integration into smart‑city infrastructures.
Comments & Academic Discussion
Loading comments...
Leave a Comment