Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning

Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Object detection has been widely explored for class-balanced datasets such as COCO. However, real-world scenarios introduce the challenge of long-tailed distributions, where numerous categories contain only a few instances. This inherent class imbalance biases detection models towards the more frequent classes, degrading performance on rare categories. In this paper, we tackle the problem of long-tailed 2D object detection using the LVISv1 dataset, which consists of 1,203 categories and 164,000 images. We employ a two-stage Faster R-CNN architecture and propose enhancements to the Balanced Group Softmax (BAGS) framework to mitigate class imbalance. Our approach achieves a new state-of-the-art performance with a mean Average Precision (mAP) of 24.5%, surpassing the previous benchmark of 24.0%. Additionally, we hypothesize that tail class features may form smaller, denser clusters within the feature space of head classes, making classification challenging for regression-based classifiers. To address this issue, we explore metric learning to produce feature embeddings that are both well-separated across classes and tightly clustered within each class. For inference, we utilize a k-Nearest Neighbors (k-NN) approach to improve classification performance, particularly for rare classes. Our results demonstrate the effectiveness of these methods in advancing long-tailed object detection.


💡 Research Summary

This paper tackles the long‑tailed object detection problem on the LVIS v1 dataset (1,203 categories, 164 k images) by enhancing the Balanced Group Softmax (BAGS) framework and integrating metric learning. Starting from a standard two‑stage Faster RCNN with a ResNet‑50 backbone, the authors first address classifier bias through τ‑normalization, which rescales classifier weights by a temperature‑controlled norm, eliminating the need for extra training.

The core contribution lies in a systematic refinement of BAGS. The original method partitions categories into four frequency‑based bins (0‑10, 10‑100, 100‑1 000, >1 000) and applies independent softmax within each bin. The authors observe that intra‑bin imbalance remains severe, so they (1) redesign the bins using frequency‑based clustering, resulting in five groups (0‑22, 23‑90, 91‑1 000, 1 001‑18 050, >18 051) and further split the largest bin (100‑1 000) into two sub‑bins (100‑500, 500‑1 000); (2) apply class‑wise inverse‑frequency weighting inside each bin, normalizing and scaling the weights to preserve overall loss magnitude; and (3) replace the standard cross‑entropy with Focal Loss (γ = 2) to focus learning on hard, often rare, examples. These modifications reduce the dominance of head classes within each group while preserving sufficient samples for stable training.

Beyond loss‑level adjustments, the authors hypothesize that tail classes suffer not only from classifier bias but also from poorly separated feature embeddings. To address this, they embed three metric‑learning objectives into the joint training of the backbone and ROI head: (a) Center Loss, which pulls features toward class‑specific centroids; (b) CosFace (cosine margin loss), which enforces an angular margin between classes; and (c) a novel Euclidean Cross‑Entropy loss that directly minimizes intra‑class distances while preserving the discriminative power of softmax. The combined loss (weighted sum of the three) encourages tight, well‑separated clusters for each category, especially benefiting the sparse tail classes.

At inference time, instead of relying solely on the softmax classifier, the paper introduces a k‑Nearest Neighbors (k‑NN) decision rule over the learned embeddings. For each region proposal, the k closest training embeddings are retrieved (k = 5 or 10) and the majority vote determines the final label. This non‑parametric step mitigates the tendency of softmax to assign near‑zero probabilities to rare classes, yielding a notable boost in tail‑class average precision.

Extensive experiments on LVIS v1 demonstrate the efficacy of each component. The baseline Faster RCNN with vanilla softmax reaches ~20.1 % mAP. Original BAGS improves this to ~24.0 % (rare AP ≈ 12.3 %). Adding the proposed bin redesign (+0.4 % AP), class weighting (+0.3 % AP), and Focal Loss (+0.2 % AP) yields incremental gains. Metric‑learning losses contribute an additional +0.5 % AP, while the k‑NN inference adds another +0.3 % AP, culminating in a final mAP of 24.5 % and rare‑class AP of 13.8 %. t‑SNE visualizations confirm that tail categories form distinct, compact clusters separate from head categories.

The authors discuss limitations: the clustering‑based binning may need dataset‑specific tuning, k‑NN inference incurs extra memory and compute overhead, and metric‑learning hyper‑parameters require careful balancing to avoid over‑margin effects. Future work is suggested on dynamic bin generation, lightweight alternatives to k‑NN (e.g., prototype‑based classifiers), and extending the approach to transformer‑based backbones.

Overall, the paper presents a comprehensive, multi‑level strategy—refined group softmax, intra‑group weighting, focal loss, metric learning, and k‑NN inference—that collectively pushes the state‑of‑the‑art on long‑tailed object detection, offering valuable insights for researchers and practitioners dealing with highly imbalanced visual datasets.


Comments & Academic Discussion

Loading comments...

Leave a Comment