DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.

💡 Research Summary

The paper tackles a fundamental challenge in civil‑engineering computer vision: the scarcity of annotated images for specialized tasks such as tunnel crack detection or construction‑site personal‑protective‑equipment (PPE) monitoring. To bridge the gap between data‑efficiency and real‑time performance, the authors propose DINO‑YOLO, a hybrid architecture that fuses the speed‑optimized YOLOv12 detector with the representation power of the self‑supervised vision transformer DINOv3.

Architecture

YOLOv12 serves as the backbone for object localization and classification. DINOv3, pre‑trained on large unlabelled image collections using a contrastive self‑supervised objective, provides two sets of feature maps that are injected into the YOLO pipeline at distinct stages:

P0 – Input preprocessing – The first DINOv3 block processes the raw image and produces a low‑resolution feature map. This map is concatenated with the original RGB tensor before the first convolutional layer of YOLO, effectively giving the detector a richer, globally‑aware representation from the very beginning.
P3 – Mid‑backbone enhancement – After the third convolutional stage (C3 module) of YOLO, a second DINOv3 block outputs an attention‑enhanced feature map. After channel alignment, this map is added to the YOLO feature via a residual connection, allowing the detector to refine its intermediate representations with transformer‑style context.

The authors explore three integration strategies: Dual‑P0P3 (both points), Triple‑Integration (adding a third insertion point for the smallest YOLO scale), and a baseline with no DINO insertion.

Experimental Protocol

Four public and two in‑house datasets are used:

Tunnel Segment Crack – 648 images, 5 crack classes.
Construction PPE – 1,000 images, 7 safety‑gear classes.
KITTI – 7,000 images, 3 vehicle/pedestrian classes (used as a larger‑scale benchmark).

YOLOv12 is evaluated in five scales (S, M, L, XL, XXL). DINOv3 is instantiated in nine variants differing in token count (16, 32, 64), depth (6, 12 layers), and attention heads (4, 8). All models are trained on an NVIDIA RTX 5090, and performance is measured by mAP@0.5, frames‑per‑second (FPS), and inference latency (ms).

Key Results

Scale	Integration	Dataset	ΔmAP@0.5	FPS	Latency
Medium (M)	Dual‑P0P3 (DINO‑Base, 32‑token, 12‑layer, 8‑head)	Tunnel Crack	+12.4 %	30	21 ms
Medium	Dual‑P0P3	Construction PPE	+13.7 %	32	23 ms
Medium	Dual‑P0P3	KITTI	+88.6 %	30	25 ms
Small (S)	Triple‑Integration	Tunnel Crack	+9.8 %	35	28 ms
Large (L)	Dual‑P0P3	KITTI	+5.2 %	27	30 ms

The Medium‑scale model with Dual‑P0P3 integration achieves the highest overall mAP of 55.77 %, while the Small‑scale model requires a third insertion point to reach 53.63 %. Across all DINO variants, the “Large‑Base” configuration (32 tokens, 12 layers, 8 heads) consistently yields the best trade‑off between accuracy and compute.

Technical Insights

Global Context Benefits Low‑Data Regimes – DINOv3’s self‑attention captures scene‑wide relationships that are absent in pure convolutional pipelines. This is especially valuable for crack detection, where the target occupies a tiny fraction of the image and is often obscured by complex backgrounds.
Two‑Stage Feature Fusion – Injecting transformer features both before any convolution (P0) and after an intermediate convolutional block (P3) allows the detector to benefit from global cues early on while still preserving the fine‑grained spatial detail learned by YOLO’s later layers.
Scale‑Specific Design – The Medium backbone offers enough capacity to absorb the additional transformer features without saturating, whereas the Large backbone suffers diminishing returns due to redundant capacity. The Small backbone, being capacity‑constrained, needs an extra fusion point to fully exploit the transformer information.
Real‑Time Feasibility – Although DINO‑YOLO introduces a 2–4× increase in FLOPs (21–33 ms latency vs. 8–16 ms for vanilla YOLOv12), the resulting 30–47 FPS on an RTX 5090 remains within the real‑time threshold required for on‑site inspection drones, robotic crawlers, or edge‑mounted cameras.

Limitations & Future Work

Memory Footprint – The added transformer blocks raise GPU memory consumption by ~1.8×, limiting deployment on low‑power edge devices. Model compression (pruning, quantization) and knowledge‑distillation to a pure CNN student are promising avenues.
Domain‑Specific Pre‑Training – The current DINOv3 weights are pre‑trained on generic image corpora (ImageNet‑22K). Pre‑training on domain‑specific unlabelled footage (e.g., construction‑site video streams) could further improve transferability.
Broader Applicability – While the paper focuses on civil‑engineering datasets (<10 K images), the methodology is readily extensible to other data‑scarce fields such as medical imaging, precision agriculture, or underwater inspection.

Conclusion

DINO‑YOLO demonstrates that self‑supervised vision transformers can be seamlessly integrated into a high‑speed detector to achieve substantial accuracy gains in data‑constrained civil‑engineering scenarios, without sacrificing real‑time performance. The systematic ablation across YOLO scales and DINO variants provides clear guidelines: use a Medium‑scale YOLO with Dual‑P0P3 fusion for the best overall balance; adopt Triple‑Integration for Small‑scale deployments where hardware resources are limited. By delivering up to 88.6 % relative mAP improvement on KITTI and double‑digit gains on specialized crack and PPE datasets, the work establishes a new state‑of‑the‑art baseline for practical, on‑site infrastructure inspection and construction‑site safety monitoring. Future research will focus on model compression, domain‑specific self‑supervised pre‑training, and extending the framework to other low‑annotation domains.