Point2RBox-v3: Self-Bootstrapping from Point Annotations via Integrated Pseudo-Label Refinement and Utilization

Point2RBox-v3: Self-Bootstrapping from Point Annotations via Integrated Pseudo-Label Refinement and Utilization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Driven by the growing need for Oriented Object Detection (OOD), learning from point annotations under a weakly-supervised framework has emerged as a promising alternative to costly and laborious manual labeling. In this paper, we discuss two deficiencies in existing point-supervised methods: inefficient utilization and poor quality of pseudo labels. Therefore, we present Point2RBox-v3. At the core are two principles: 1) Progressive Label Assignment (PLA). It dynamically estimates instance sizes in a coarse yet intelligent manner at different stages of the training process, enabling the use of label assignment methods. 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss). It is an enhancement of the Voronoi Watershed Loss from Point2RBox-v2, which overcomes the shortcomings of Watershed in its poor performance in sparse scenes and SAM’s poor performance in dense scenes. To our knowledge, Point2RBox-v3 is the first model to employ dynamic pseudo labels for label assignment, and it creatively complements the advantages of SAM model with the watershed algorithm, which achieves excellent performance in both sparse and dense scenes. Our solution gives competitive performance, especially in scenarios with large variations in object size or sparse object occurrences: 66.09%/56.86%/41.28%/46.40%/19.60%/45.96% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR.


💡 Research Summary

Point2RBox‑v3 addresses two critical shortcomings of existing point‑supervised oriented object detection (OOD) methods: inefficient utilization of pseudo‑labels and poor pseudo‑label quality. The authors introduce two novel components that together enable a fully end‑to‑end weakly‑supervised detector with performance comparable to fully‑supervised models.

Progressive Label Assignment (PLA) is the first mechanism that brings multi‑scale label assignment—standard in fully‑supervised FPN‑based detectors—into the point‑supervised regime. At the beginning of training, coarse object extents are derived from a Voronoi Watershed segmentation of the image using only the annotated points. The resulting minimum‑area rectangles serve as initial pseudo‑labels, providing rough width, height, and orientation estimates. These pseudo‑labels are then used to assign each point to an appropriate FPN level based on its estimated scale. As training proceeds, the detector’s own predictions are harvested: for each point, the predicted box from the anchor closest to the point on each FPN level is collected, and the box with the highest classification confidence is selected as the updated pseudo‑label. This dynamic refinement supplies progressively more accurate scale information, allowing the network to exploit the hierarchical feature maps of the FPN rather than collapsing all points onto a single level as in prior work.

Prior‑Guided Dynamic Mask Loss (PGDM‑Loss) tackles the complementary weaknesses of Watershed and Segment‑Anything Model (SAM) masks. Watershed yields reliable masks in dense scenes but suffers from under‑segmentation in sparse scenes where spatial cues are scarce. Conversely, SAM (here implemented as the lightweight MobileSAM) produces high‑quality masks for sparse scenes but is computationally heavy and prone to over‑segmentation when many objects are close together. PGDM‑Loss routes each training image to one of two mask‑generation branches based on a simple instance‑count threshold N_thr. Images with ≤ N_thr instances are processed by the SAM branch; denser images use the original Watershed branch. For SAM‑generated candidates, a prior‑guided scoring function evaluates five metrics—center alignment, color consistency, rectangularity, circularity, and aspect‑ratio reliability—weighted by class‑specific priors. The highest‑scoring mask is selected, rotated to align with the current detection prediction, and then supervised with a Gaussian Wasserstein Distance loss on the width‑height regression. This hybrid loss provides accurate mask supervision without adding inference overhead, as SAM is used only during training.

The overall architecture retains the strong backbone and head of Point2RBox‑v2 (ResNet‑50 + FPN + PSC angle coder) and augments it with PLA and PGDM‑Loss while preserving the original auxiliary losses (classification, box regression, angle loss, etc.). The training pipeline first generates static Watershed masks, then progressively replaces them with dynamic network‑predicted boxes and SAM masks where appropriate.

Extensive experiments on six benchmark datasets—DOTA‑v1.0, DOTA‑v1.5, DOTA‑v2.0, DIOR, STAR, and RSAR—demonstrate consistent gains. Point2RBox‑v3 achieves 66.09 % mAP on DOTA‑v1.0, 56.86 % on DOTA‑v1.5, 41.28 % on DOTA‑v2.0, 46.40 % on DIOR, 19.60 % on STAR, and 45.96 % on RSAR, outperforming the previous state‑of‑the‑art Point2RBox‑v2 (average 59.6 % mAP) by 6–7 percentage points. Ablation studies confirm that PLA alone contributes roughly 2–3 %p improvement, PGDM‑Loss alone adds 3–4 %p, and their combination yields the full boost. Training time increases modestly (≈12 % extra), while inference speed remains unchanged because SAM is omitted at test time.

The paper’s contributions are threefold: (1) introducing dynamic pseudo‑label generation and multi‑scale assignment to weakly‑supervised OOD, (2) designing a density‑aware mask loss that intelligently fuses Watershed and SAM, and (3) demonstrating that these innovations close much of the performance gap between point‑supervised and fully‑supervised detectors without incurring significant computational cost. Limitations include reliance on a fixed instance‑count threshold and the use of a lightweight SAM variant, which may not capture the full potential of more powerful segmentation models in extremely cluttered scenes. Future work could explore adaptive thresholding, richer priors for mask selection, and integration with video‑level temporal consistency.

In summary, Point2RBox‑v3 represents a significant step forward for cost‑effective oriented object detection, proving that with clever pseudo‑label refinement and dynamic supervision, point annotations alone can drive high‑accuracy, scale‑aware detectors suitable for diverse aerial, industrial, and retail applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment