Multi-Task Learning with Multi-Annotation Triplet Loss for Improved Object Detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Triplet loss traditionally relies only on class labels and does not use all available information in multi-task scenarios where multiple types of annotations are available. This paper introduces a Multi-Annotation Triplet Loss (MATL) framework that extends triplet loss by incorporating additional annotations, such as bounding box information, alongside class labels in the loss formulation. By using these complementary annotations, MATL improves multi-task learning for tasks requiring both classification and localization. Experiments on an aerial wildlife imagery dataset demonstrate that MATL outperforms conventional triplet loss in both classification and localization. These findings highlight the benefit of using all available annotations for triplet loss in multi-task learning frameworks.

💡 Research Summary

The paper addresses a fundamental limitation of conventional triplet loss, which relies solely on class labels, by proposing a Multi‑Annotation Triplet Loss (MATL) that also incorporates bounding‑box information. The authors first extract two geometric features from each object’s bounding box: area (width × height) and symmetric squareness (the minimum of width/height and height/width, subtracted from 1). After min‑max normalization, these four features (area, squareness, width, height) are clustered with K‑means (K = 3) using the elbow method, yielding three “box labels” that capture typical size‑shape patterns (large‑elongated, small‑elongated, small‑square).

MATL combines two triplet losses: one based on traditional class labels (L_class) and one based on the newly created box labels (L_box). The overall loss is a weighted sum L_MATL = (1 − λ)L_class + λL_box, where λ controls the contribution of the box‑based term. Experiments on the Animal Wildlife Image Repository (AWIR) – a collection of aerial RGB images of deer, horses, and cows – use 300 × 300 pixel tiles containing a single animal each. A stratified 8‑fold cross‑validation (30 % training, 70 % test) is repeated eight times to obtain mean and standard deviation.

The network architecture consists of a dilated‑convolution encoder (16 → 512 channels) feeding a shared latent space, a decoder that reconstructs a binary mask (used to derive a bounding box), and a separate fully‑connected classifier head. In the multi‑task configuration, both the classification head and the box‑detector head operate on the same latent representation, encouraging the encoder to learn features useful for both tasks.

Results show that MATL consistently outperforms both a baseline without any triplet loss (WTL) and a model using only class‑label triplet loss (CL TL). For the single‑task setting, classification accuracy improves from 58.2 % (WTL) to 83.0 % (MATL) while IoU rises from 0.190 to 0.185 (λ = 0.25). In the multi‑task setting, accuracy climbs from 67.8 % (WTL) to 83.2 % (MATL) and IoU from 0.178 to 0.179. The authors find that λ = 0.25 offers the best trade‑off: higher λ values increase the influence of box information but slightly degrade classification performance.

PCA visualizations of the latent space reveal that the class‑only loss creates well‑separated class clusters but ignores intra‑class variation. Adding box labels produces sub‑clusters within each class, reflecting differences in object size and shape, which are crucial for precise localization. This demonstrates that MATL learns a richer embedding that simultaneously respects inter‑class separability and intra‑class spatial diversity.

The paper acknowledges limitations: the box labels depend on K‑means clustering, so the choice of K and initialization may affect results; only RGB imagery is used, leaving multimodal extensions (e.g., thermal or LiDAR data) for future work. Nonetheless, the study convincingly shows that integrating multiple annotation types into triplet loss yields a more structured latent space and measurable gains in both classification and localization for object detection tasks.

Multi-Task Learning with Multi-Annotation Triplet Loss for Improved Object Detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment