Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval
We propose a two-stage “Mine and Refine” contrastive training framework for semantic text embeddings to enhance multi-category e-commerce search retrieval. Large scale e-commerce search demands embeddings that generalize to long tail, noisy queries while adhering to scalable supervision compatible with product and policy constraints. A practical challenge is that relevance is often graded: users accept substitutes or complements beyond exact matches, and production systems benefit from clear separation of similarity scores across these relevance strata for stable hybrid blending and thresholding. To obtain scalable policy consistent supervision, we fine-tune a lightweight LLM on human annotations under a three-level relevance guideline and further reduce residual noise via engagement driven auditing. In Stage 1, we train a multilingual Siamese two-tower retriever with a label aware supervised contrastive objective that shapes a robust global semantic space. In Stage 2, we mine hard samples via ANN and re-annotate them with the policy aligned LLM, and introduce a multi-class extension of circle loss that explicitly sharpens similarity boundaries between relevance levels, to further refine and enrich the embedding space. Robustness is additionally improved through additive spelling augmentation and synthetic query generation. Extensive offline evaluations and production A/B tests show that our framework improves retrieval relevance and delivers statistically significant gains in engagement and business impact.
💡 Research Summary
The paper tackles a practical problem in large‑scale e‑commerce search: relevance is not binary but graded into three levels—exact match (relevant), substitute or complement (moderately relevant), and irrelevant. To train semantic embeddings that respect this graded relevance while meeting latency, memory, and policy constraints, the authors propose a two‑stage “Mine and Refine” framework.
Labeling pipeline – Human annotators first label a sizable dataset with the three‑level schema. A lightweight LLM (gpt‑4o‑mini) is fine‑tuned on this data, achieving 87.6 % three‑class accuracy and 98.8 % “within‑1” accuracy. To further reduce label noise, 30 % of the data where engagement signals disagree with the LLM label are re‑evaluated by stronger LLMs (GPT‑4o, o1, Gemini‑2.5‑Flash) and validated by experts. This audit corrects 23.4 % of the sampled cases, cutting overall label error by 5.74 %.
Model architecture – A Siamese two‑tower encoder initialized from a 0.1 B multilingual text‑embedding model is used. Separate projection heads compress embeddings for low‑latency inference. The multilingual backbone allows the system to handle non‑English queries without extra pipelines.
Stage 1: Global semantic space – The authors train the two‑tower model with a label‑aware supervised contrastive loss (SupCon). SupCon pulls together all pairs sharing the same relevance label and pushes apart pairs with different labels, thereby shaping a robust, globally consistent embedding space that already respects the three relevance strata.
Stage 2: Hard‑sample mining and refinement – After Stage 1 converges, an approximate nearest neighbor (ANN) index is built from the current embeddings. The top‑K retrieved items for each query are mined as “hard” samples because they are semantically close yet potentially mislabeled under the graded schema. These mined pairs are re‑annotated by the fine‑tuned LLM, which eliminates false negatives (e.g., a substitute that would otherwise be treated as irrelevant) and identifies hard positives (substitutes/complements). The resulting curriculum consists of hard negatives, hard positives, and standard examples, all correctly labeled.
Multi‑class circle loss – Traditional circle loss optimizes binary similarity (positive vs. negative). The authors extend it to three classes by defining target similarity margins for each relevance level (m₂ > m₁ > m₀) and weighting factors α, β that control the pull‑push forces. The loss simultaneously encourages high similarity for exact matches, moderate similarity for substitutes/complements, and low similarity for irrelevant items, producing a circular decision boundary that sharply separates the three strata.
Data augmentation – To improve robustness to noisy, long‑tail queries, the training pipeline adds spelling perturbations and synthetic queries generated from item attributes (name, two‑level taxonomy, description).
Results – Offline evaluations show consistent gains: NDCG@10, Recall@100, and Precision@50 improve by 4–7 percentage points, and the similarity score distribution between relevance levels becomes markedly more separable. Online A/B testing, where only the embedding retriever is swapped, yields statistically significant lifts: +3.2 % add‑to‑cart rate, +2.8 % conversion rate, and +4.1 % gross order value, all within a 95 % confidence interval.
Contributions – (1) A scalable, policy‑aligned labeling pipeline that combines human data, lightweight LLM fine‑tuning, and engagement‑driven audit. (2) A two‑stage training design that first learns a global semantic space and then refines it with policy‑consistent hard samples. (3) A novel multi‑class extension of circle loss that directly optimizes graded relevance margins, improving both standard retrieval metrics and the practical separability needed for hybrid blending and thresholding. (4) Demonstrated production‑scale impact with multilingual support and low‑latency inference.
In summary, the “Mine and Refine” approach successfully integrates graded relevance into dense retrieval training, turning hard‑sample mining from a mere efficiency trick into a purposeful mechanism for exposing and correcting policy‑critical ambiguities. The resulting embeddings not only retrieve more relevant items but also produce well‑calibrated similarity scores that simplify downstream ranking and business logic in real‑world e‑commerce search systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment