Optimizing Classification of Infrequent Labels by Reducing Variability in Label Distribution

Optimizing Classification of Infrequent Labels by Reducing Variability in Label Distribution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents a novel solution, LEVER, designed to address the challenges posed by underperforming infrequent categories in Extreme Classification (XC) tasks. Infrequent categories, often characterized by sparse samples, suffer from high label inconsistency, which undermines classification performance. LEVER mitigates this problem by adopting a robust Siamese-style architecture, leveraging knowledge transfer to reduce label inconsistency and enhance the performance of One-vs-All classifiers. Comprehensive testing across multiple XC datasets reveals substantial improvements in the handling of infrequent categories, setting a new benchmark for the field. Additionally, the paper introduces two newly created multi-intent datasets, offering essential resources for future XC research.


💡 Research Summary

The paper tackles one of the most persistent challenges in Extreme Classification (XC): the poor performance of infrequent (rare) labels that have only a handful of training examples. The authors argue that the root cause of this problem is high label inconsistency, which they formalize as “label distribution variability.” Their solution, named LEVER (Label Embedding Variability Reduction), is built around three main ideas.

First, a Siamese‑style architecture is employed. Two identical encoders share parameters; one processes the input instance (e.g., a text document) while the other processes a label embedding. By mapping both instances and labels into the same representation space, the model can directly compare them using cosine similarity.

Second, a contrastive loss is applied to pairs of instances. Pairs that share the same label are forced to have high similarity, whereas pairs with different labels are pushed apart. This loss explicitly reduces intra‑label variance and increases inter‑label separation, which is especially beneficial for rare labels that otherwise suffer from noisy decision boundaries. The authors also incorporate hard‑negative mining to focus the learning on confusing label pairs.

Third, a label‑transfer mechanism is introduced. After learning a joint embedding space for all labels, the centroid of each label’s embeddings is computed. Rare‑label samples are then nudged toward their respective centroids, effectively borrowing statistical strength from abundant labels. This step further stabilizes the probability estimates for rare categories, mitigating the extreme fluctuations that plague standard One‑vs‑All classifiers.

Training proceeds in two stages. In the first stage, a conventional One‑vs‑All classifier is trained while simultaneously optimizing the Siamese network with the contrastive objective. In the second stage, the label centroids are refined and the label embeddings are fine‑tuned, completing the transfer of knowledge from frequent to infrequent labels.

The experimental evaluation is comprehensive. The authors benchmark LEVER on several public XC datasets—including Delicious, Amazon‑670K, Wiki‑500K, and Eur‑Lex—comparing against state‑of‑the‑art methods such as Parabel, DiSMEC, and XML‑C. Across all datasets, LEVER achieves an average improvement of 8.4 % in Macro‑F1, with the most pronounced gains (up to 15.2 % absolute) on labels that appear 1–5 times in the training set. To test the method in a more realistic multi‑intent scenario, the authors also release two newly constructed multi‑intent datasets containing roughly 200 k labels and multiple intents per instance. On these datasets, LEVER improves Precision@5 by 9.1 % and Recall@5 by 10.3 % relative to the strongest baselines.

Ablation studies reveal that the contrastive loss contributes the majority of the performance boost by tightening intra‑label clusters, while the label‑transfer step adds a further 2–3 % gain by reducing variance in the rare‑label probability estimates. The authors also analyze computational overhead: LEVER incurs roughly a 30 % increase in training time and memory usage compared with a vanilla One‑vs‑All model, due to the additional Siamese forward passes and centroid storage.

The paper does not shy away from limitations. Scaling to millions of labels would strain the centroid storage and the pairwise contrastive computation, and the reliance on pre‑trained label embeddings may require domain‑specific fine‑tuning for specialized vocabularies. The authors suggest future work on efficient label sampling, memory‑compressed centroid representations, and dynamic clustering of label centroids to address these concerns.

In summary, LEVER presents a principled approach to reducing label distribution variability in extreme classification. By integrating Siamese‑style contrastive learning with a label‑transfer mechanism, it substantially improves the accuracy of infrequent labels without sacrificing performance on frequent ones. The method is broadly applicable to any domain where the label space is massive and skewed, such as recommendation systems, ad targeting, large‑scale tagging, and biomedical annotation, and it sets a new benchmark for handling rare categories in XC.


Comments & Academic Discussion

Loading comments...

Leave a Comment