Enhancing Features in Long-tailed Data Using Large Vision Model
Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network’s map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.
💡 Research Summary
The paper addresses the long‑tailed recognition problem without relying on any linguistic information, by leveraging a large vision model (LVM) – specifically the Segment Anything Model (SAM) – to augment the feature representations of a conventional convolutional backbone. The authors propose two complementary augmentation pathways. First, the “SAM map feature” pathway extracts a high‑dimensional feature map from SAM, reduces its channel dimension to one via Principal Component Analysis (PCA), and then expands it back to match the backbone’s channel size using a 1×1 convolution. This transformed map is fused with the backbone’s intermediate feature map through element‑wise multiplication followed by addition, thereby injecting spatially‑aware cues derived from SAM into the backbone’s representation. Second, the “SAM latent feature” pathway applies average pooling to the SAM feature map to obtain a compact latent vector. This vector is concatenated with the backbone’s final feature vector, passed through a self‑attention module and a fully‑connected layer to produce a SAM‑derived logit. The final classification logit is a weighted sum of the SAM logit and the original backbone logit, controlled by a hyper‑parameter α.
To further exploit the enriched representations, the authors design a prototype‑based loss suite. A memory bank per class stores the most recent augmented features; the class prototype is computed as the average of the bank’s entries. The loss comprises three components: (1) a center loss that pulls all features toward their respective prototypes, (2) a head‑class loss (L_head) that applies only to classes with low imbalance weights (w_y < τ_head) and encourages compact clustering of head‑class features around their prototypes, and (3) a tail‑class loss consisting of L_tail‑std (which minimizes distance to the prototype for tail classes) and L_tail‑logdist (which pushes tail features away from the prototype in a logarithmic fashion to promote diversity). This design balances the need for tight clusters in head classes with the need for diverse, well‑spread representations in tail classes.
Experiments are conducted on two widely used long‑tailed benchmarks: ImageNet‑LT and iNaturalist2018. The proposed SAM‑augmented models outperform a range of baselines, including data resampling, re‑weighting, contrastive learning, and recent language‑model‑guided methods, by 1–2 percentage points in top‑1 accuracy. Notably, the gains are more pronounced for tail classes (those with ≤100 training samples), where the prototype‑based loss yields an additional ~0.5 % improvement. Ablation studies confirm that both the map‑level fusion and the latent‑vector fusion contribute positively, and that the prototype loss further refines performance.
The paper acknowledges several limitations. SAM is used as a frozen feature extractor, so potential gains from fine‑tuning SAM are unexplored. The PCA and 1×1 convolution steps introduce extra computational overhead, which may be non‑trivial for very large datasets. Hyper‑parameters such as the memory‑bank size (M), the fusion weight α, and the head‑class threshold τ_head are not extensively analyzed for sensitivity, suggesting that practical deployment would require careful tuning. Future work could investigate fine‑tuning SAM, replace SAM with other large vision transformers, or develop more memory‑efficient prototype update mechanisms.
In summary, this work demonstrates that large vision models can be harnessed, without any textual modality, to enrich feature representations for long‑tailed classification. The combination of spatial‑level feature fusion and a carefully crafted prototype‑based loss provides a viable recipe for improving both head and tail class performance, offering a promising direction for vision‑only approaches to imbalanced learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment