Learning to Detect Baked Goods with Limited Supervision
Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.
💡 Research Summary
The paper addresses the practical problem of automatically detecting and counting leftover baked goods in German bakeries, where the short shelf‑life of fresh products makes waste reduction economically and environmentally important. Fully supervised object detection is infeasible because the variety of baked goods is large and manual annotation is costly. To overcome this, the authors propose two complementary limited‑supervision training pipelines that together achieve performance comparable to, and in some cases surpassing, a strong fully‑supervised baseline.
The first pipeline leverages open‑vocabulary detectors—OWLv2 and Grounding DINO—as zero‑shot localizers. Because the bakery’s “single‑class” image set (C train) contains only one product type per image, image‑level class labels are sufficient. The detectors are prompted with the generic term “baked good” rather than specific product names, which are not reliable visual cues. Raw predictions tend to over‑detect; therefore, a four‑step post‑processing chain (background filter, duplicate filter, crowd filter, nested filter) prunes spurious boxes and yields a clean set of pseudo‑bounding boxes. These automatically generated boxes serve as supervision for training a YOLOv11 detector. Remarkably, training only with image‑level labels and the weakly generated boxes yields a mean Average Precision (mAP) of 0.91 on a realistic deployment test set (D test).
The second pipeline targets viewpoint robustness, a critical factor when the camera angle deviates from the ideal top‑down view used during data collection. The authors record videos (V train) where the camera gradually tilts, then extract frames at 4 fps. Only the first frame of each video is manually annotated; the remaining frames are automatically labeled using Segment Anything 2 (SAM 2) as a pseudo‑label propagation model. This approach reduces manual annotation effort by more than 96 %. The pseudo‑labeled frames are then used to fine‑tune the YOLOv11 model. Under non‑ideal deployment conditions (varying lighting and slanted viewpoints), this fine‑tuned model improves performance by 19.3 % relative to the weakly supervised baseline.
The authors compile a comprehensive dataset covering 19 baked‑good classes (18 specific products plus a fallback “Backware” class). The dataset includes:
- D (train/test): 763 deployment images captured in the bakery with full bounding‑box annotations (610 train, 153 test).
- C train: 315 single‑class images (average 7 items per image) used for weak supervision.
- V train/test: 4,945 training frames and 1,186 test frames from 209 videos, annotated via SAM 2 propagation.
Experiments compare three models: (i) fully supervised baseline trained on D train, (ii) weakly supervised model trained on C train using OWLv2/Grounding DINO boxes, and (iii) the fine‑tuned model incorporating V train pseudo‑labels. The weakly supervised model already matches the baseline on the ideal test set, while the fine‑tuned model outperforms the baseline on the challenging viewpoint test set.
Key contributions are:
- A scalable, cost‑effective workflow that combines open‑vocabulary detectors with image‑level supervision to generate high‑quality bounding boxes without manual drawing.
- An efficient pseudo‑label propagation pipeline using SAM 2 that dramatically reduces annotation effort while improving robustness to viewpoint changes.
- A thorough evaluation on a real‑world, application‑relevant dataset, demonstrating that limited‑supervision training can achieve industrial‑grade performance.
The final system is integrated into an iOS app for on‑device inference, enabling bakery staff to capture overhead images of leftover products with a simple rig and obtain immediate counts. The study illustrates how modern foundation models (open‑vocabulary detectors, SAM 2) can be repurposed as annotation tools, turning the lack of labeled data from a barrier into an opportunity for rapid, domain‑specific deployment across various industries where bespoke visual tasks and scarce annotations are the norm.
Comments & Academic Discussion
Loading comments...
Leave a Comment