RexBERT: Context Specialized Bidirectional Encoders for E-commerce

RexBERT: Context Specialized Bidirectional Encoders for E-commerce
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Encoder-only transformers remain indispensable in retrieval, classification, and ranking systems where latency, stability, and cost are paramount. Most general purpose encoders, however, are trained on generic corpora with limited coverage of specialized domains. We introduce RexBERT, a family of BERT-style encoders designed specifically for e-commerce semantics. We make three contributions. First, we release Ecom-niverse, a 350 billion token corpus curated from diverse retail and shopping sources. We describe a modular pipeline that isolates and extracts e-commerce content from FineFineWeb and other open web resources, and characterize the resulting domain distribution. Second, we present a reproducible pretraining recipe building on ModernBERT’s architectural advances. The recipe consists of three phases: general pre-training, context extension, and annealed domain specialization. Third, we train RexBERT models ranging from 17M to 400M parameters and evaluate them on token classification, semantic similarity, and general natural language understanding tasks using e-commerce datasets. Despite having 2-3x fewer parameters, RexBERT outperforms larger general-purpose encoders and matches or surpasses modern long-context models on domain-specific benchmarks. Our results demonstrate that high quality in-domain data combined with a principled training approach provides a stronger foundation for e-commerce applications than indiscriminate scaling alone.


💡 Research Summary

The paper introduces RexBERT, a family of BERT‑style encoder‑only transformers specifically tailored for e‑commerce language. Recognizing that most publicly available encoders are trained on generic web corpora and therefore lack the nuanced understanding required for product‑related tasks, the authors construct a dedicated 350‑billion‑token corpus named Ecom‑niverse. This corpus is derived from the FineFineWeb dataset, a 4.4‑trillion‑token CommonCrawl‑derived collection organized into roughly fifty topical domains. By manually selecting domains with high commercial relevance (e.g., Hobby, News, Fashion, Beauty) and applying a multi‑stage filtering pipeline—LLM‑based binary relevance labeling (using Phi‑4 and validated with Llama 3‑70B), QA feedback loops, fast‑text distillation for scalable scoring, deduplication, language detection, and profanity filtering—the authors isolate over 350 billion high‑quality English tokens that predominantly discuss retail, product attributes, and consumer behavior.

Building on the architectural advances of ModernBERT, RexBERT incorporates several modernizations: bias‑less linear layers, pre‑layer normalization, rotary positional embeddings (RoPE) with NTK‑aware scaling, GeGLU activation functions, and an alternating pattern of global and local attention to balance long‑range context with computational efficiency. The tokenizer is a 50,368‑token BPE derived from OLMo, offering better token efficiency than the original WordPiece.

Training proceeds in three carefully designed phases:

  1. General Pre‑training – 1.7 trillion tokens from a diverse mix (web, books, code, technical papers) are used with a short 512‑token sequence length and a high 30 % masking ratio. This phase establishes robust general language representations.

  2. Context Extension – The maximum sequence length is increased to 8,192 tokens, and the model is trained for an additional 250 billion tokens. RoPE and NTK scaling allow the model to extrapolate positional information, while alternating global/local attention enables efficient processing of long product pages, FAQs, and concatenated attribute blocks.

  3. Annealed Domain Specialization – Finally, the model is fine‑tuned on the Ecom‑niverse corpus for roughly 350 billion tokens. Here the authors introduce Guided MLM, a targeted masking strategy that preferentially masks spans identified as product names, attribute values, or other commerce‑specific entities. Approximately 5 % of batches use this guided masking, while the remaining 95 % retain standard random span masking. The masking ratio is reduced to 10‑15 % to avoid over‑fitting, and sampling weights are gradually annealed toward the domain corpus, preserving general knowledge while emphasizing e‑commerce semantics.

Optimization uses StableAdamW (a variant of AdamW with Adafactor‑style clipping) and a trapezoidal learning‑rate schedule dubbed Warmup‑Stable‑Decay (WSD), which holds a constant learning rate for most of training before decaying with a 1 − √ schedule. Batch sizes are progressively increased during warm‑up to maximize hardware utilization.

Four model sizes are released: Micro (17 M parameters), Mini (68 M), Base (150 M), and Large (400 M). Architectural details (layers, hidden size, intermediate size, attention heads) are provided in Table 2. Despite having 2‑3× fewer parameters than many publicly available encoders, RexBERT consistently outperforms larger general‑purpose models on a suite of downstream tasks:

  • Token Classification – Using Amazon ESCI‑derived product attribute tagging, RexBERT achieves higher accuracy than BERT‑large (340 M) and RoBERTa‑base (125 M).
  • Semantic Similarity – Spearman correlation on e‑commerce similarity benchmarks surpasses that of long‑context models such as Longformer, even when those models operate with comparable or larger parameter counts.
  • General NLU (GLUE) – Performance remains competitive, demonstrating that domain specialization does not catastrophically forget general language abilities.

Ablation studies reveal that removing Guided MLM reduces performance by roughly 1.5 percentage points on domain‑specific metrics, confirming the value of targeted masking. The authors also compare against prior domain‑specific encoders (e.g., E‑BERT, CatBERT) and show that RexBERT’s open‑data pipeline yields superior results while remaining fully reproducible.

The paper discusses limitations: reliance on LLM‑generated relevance labels introduces potential labeling noise; the current focus is English‑only, leaving multilingual e‑commerce contexts for future work; and the 8 k token context demands substantial GPU memory, which may affect deployment cost. Nonetheless, the modular pipeline—domain selection, LLM labeling, fast‑text distillation, multi‑phase curriculum—offers a blueprint that can be adapted to other specialized fields such as biomedical, legal, or scientific text.

In conclusion, RexBERT demonstrates that a carefully curated, large‑scale domain corpus combined with a principled, multi‑stage training regimen can produce compact, high‑performing encoders that outperform much larger generic models on domain‑specific tasks. The work underscores the importance of data quality and curriculum design over sheer model size, and it provides the community with both the datasets and reproducible training scripts to extend the approach to any vertical where language understanding is critical.


Comments & Academic Discussion

Loading comments...

Leave a Comment