A Synthetic Dataset for Manometry Recognition in Robotic Applications

A Synthetic Dataset for Manometry Recognition in Robotic Applications
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper addresses the challenges of data scarcity and high acquisition costs in training robust object detection models for complex industrial environments, such as offshore oil platforms. Data collection in these hazardous settings often limits the development of autonomous inspection systems. To mitigate this issue, we propose a hybrid data synthesis pipeline that integrates procedural rendering and AI-driven video generation. The approach uses BlenderProc to produce photorealistic images with domain randomization and NVIDIA’s Cosmos-Predict2 to generate physically consistent video sequences with temporal variation. A YOLO-based detector trained on a composite dataset, combining real and synthetic data, outperformed models trained solely on real images. A 1:1 ratio between real and synthetic samples achieved the highest accuracy. The results demonstrate that synthetic data generation is a viable, cost-effective, and safe strategy for developing reliable perception systems in safety-critical and resource-constrained industrial applications.


💡 Research Summary

The paper tackles the chronic problem of data scarcity in training robust object‑detection models for hazardous industrial environments such as offshore oil platforms, where collecting and annotating real images of safety‑critical assets (e.g., analog pressure gauges or manometers) is expensive, time‑consuming, and poses safety risks to personnel. To address this, the authors propose a hybrid synthetic‑data generation pipeline that combines two complementary approaches: (1) procedural rendering with BlenderProc and (2) AI‑driven video synthesis with NVIDIA’s Cosmos‑Predict2, orchestrated through ComfyUI.

BlenderProc is used to create photorealistic still images from 3D CAD models of manometers. The pipeline applies extensive domain randomization—random backgrounds drawn from industrial textures and photographs, varied illumination, diverse camera poses, and post‑processing effects such as sensor noise, blur, and chromatic aberration. Because the rendering is fully synthetic, pixel‑perfect bounding‑box and segmentation masks are generated automatically, eliminating manual labeling costs.

Cosmos‑Predict2, a world‑foundation model specialized for “physical AI,” is then employed to extend short real video clips into longer, temporally consistent sequences. Using ComfyUI workflows, the model performs relighting, viewpoint changes, motion blur, occlusion handling, and realistic jitter, thereby injecting temporal diversity that static renders cannot provide. Pseudo‑labels are propagated across frames via tracking, filtered by confidence thresholds, and a subset is audited by human experts to keep label noise low.

The authors collected 2,500 real images of manometers, annotated them with a hybrid manual (CVAT) and semi‑automatic (SAM2) workflow, and built several synthetic datasets. They experimented with three real‑to‑synthetic ratios: (i) real‑only, (ii) 1:1 (2,500 real + 2,500 synthetic), and (iii) 1:3 (2,500 real + 7,500 synthetic). Within the synthetic portion, a baseline split of 70 % BlenderProc renders and 30 % Cosmos‑generated video frames was used. All configurations were trained on the same YOLO‑based detector (identical architecture, hyper‑parameters, and training schedule) and evaluated on a held‑out set of real images using mean Average Precision (mAP) across IoU thresholds 0.5–0.95, recall, and AR.

Results show that the 1:1 mixed dataset achieved the highest mAP of 0.962, surpassing the real‑only baseline (0.936) and the 1:3 mix (0.928). The 1:1 mix also delivered the best recall (0.972) and AR (0.972). Training dynamics revealed that the 1:1 configuration converged faster, reaching high precision within the first few epochs, while the 1:3 mix showed slower early learning and plateaued near the real‑only performance. The 0.5:0.5 scenario (half the real images, half synthetic) still outperformed a naïve halving of the real dataset, indicating that synthetic augmentation can partially compensate for reduced real data.

From a computational standpoint, rendering 1,000 BlenderProc images required ~13 minutes (≈77 imgs/min), whereas generating a 15‑second video with Cosmos‑Predict2 took ~8 minutes, yielding ~300 frames/min after decomposition. Training time scaled with dataset size: the 1:1 mix trained in 1 h 54 m, compared to 3 h 18 m for real‑only, demonstrating that synthetic data can also reduce overall development time.

Key insights: (1) Procedural rendering provides precise, controllable annotations, while foundation‑model video synthesis adds realistic temporal variations, together narrowing the sim‑to‑real gap. (2) A balanced real‑to‑synthetic ratio (≈1:1) maximizes both accuracy and training efficiency; excessive synthetic data (1:3) yields diminishing returns and higher compute cost. (3) Synthetic data is a cost‑effective, safe alternative for building large, diverse training sets in safety‑critical industrial robotics.

Limitations include evaluation on a single asset class and a fixed camera setup, focus solely on 2‑D detection (no 6‑D pose or OCR), potential artifacts in AI‑generated video frames, and lack of benchmarking on embedded or edge‑computing platforms. Future work will extend the pipeline to multiple asset types (valves, digital gauges, flowmeters, placards), explore richer perception tasks (instance segmentation, 6‑D pose estimation, OCR of gauge readings), incorporate multimodal sensors, and assess real‑time performance on robot‑mounted hardware.


Comments & Academic Discussion

Loading comments...

Leave a Comment