Luxical: High-Speed Lexical-Dense Text Embeddings
Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today’s dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed “lexical-dense” text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF–IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.
💡 Research Summary
The paper introduces Luxical, a novel “lexical‑dense” embedding framework designed to bridge the gap between ultra‑fast lexical classifiers (e.g., FastText) and computationally heavy transformer‑based text encoders. Luxical converts a document into a sparse TF‑IDF vector over a fixed vocabulary of about two million 5‑grams, then passes this vector through a small ReLU‑based multilayer perceptron (MLP) to obtain a dense, ℓ2‑normalized embedding of 192 dimensions. The first linear layer of the MLP is implemented as a sparse‑by‑dense multiplication, gathering only the columns corresponding to non‑zero TF‑IDF entries; this operation is accelerated with Numba‑optimized kernels and a Rust‑based tokenizer, making the entire pipeline CPU‑friendly.
Training leverages knowledge distillation from a strong teacher model (snowflake‑arctic‑embed‑m‑v2.0). For each batch, student embeddings S and teacher embeddings T are normalized, and their Gram matrices G_s = S Sᵀ and G_t = T Tᵀ are computed. After removing the diagonal and applying temperature scaling (τ = 3.0), a KL‑divergence loss aligns the student’s pairwise similarity structure with that of the teacher. This Gram‑matrix distillation encourages the student to reproduce the relational geometry of the teacher without requiring explicit pairwise labels, enabling efficient training on 50 M FineWeb documents over three epochs with a batch size of 3072 on CPUs.
Empirical evaluation focuses on two realistic web‑scale scenarios. First, an end‑to‑end throughput benchmark embeds 100 k FineWeb documents on an Apple M4 Max CPU and an NVIDIA A10G GPU. Luxical‑One achieves 6,803 documents per second (23 MiB/s), outpacing a GPU‑accelerated MiniLM‑L6‑v2 by 34× and a Qwen‑3‑0.6B model by 97×, demonstrating that the sparse‑by‑dense design translates into practical speed gains. Second, a document‑half matching task splits 50 k documents into halves, forming 100 k query‑gallery pairs. Luxical‑One’s top‑1 error is higher than large transformer models, but its error curve rapidly converges as the retrieval window widens; at modest recall levels (e.g., top 1 % of candidates) it surpasses MiniLM and approaches the performance of much larger models while maintaining orders‑of‑magnitude higher throughput.
The analysis highlights several strengths: (1) a simple, modular architecture that retains the flexibility of dense embeddings while exploiting the efficiency of bag‑of‑ngrams; (2) a training objective that captures relational knowledge from powerful teachers without heavy computational overhead; (3) CPU‑centric optimizations (Rust tokenizer, Numba kernels) that make the system suitable for large‑scale preprocessing pipelines where GPU resources are scarce or cost‑prohibitive. Limitations include reliance on a fixed n‑gram vocabulary (which may hinder rapid domain adaptation) and lower absolute precision in strict top‑1 retrieval compared to state‑of‑the‑art transformers. Nonetheless, the authors argue that these trade‑offs are acceptable for many web‑scale organization tasks where coarse‑grained similarity (e.g., mining the top few percent of nearest neighbors) is sufficient.
In conclusion, Luxical offers a compelling point on the speed‑quality frontier for text embedding: it delivers dense, versatile representations at transformer‑level quality for many practical workloads while operating at or near FastText speeds on CPUs. The open‑source release (GitHub and Hugging Face) enables immediate adoption in both research and production environments, potentially reshaping how massive web corpora are filtered, clustered, and prepared for downstream large language model training.
Comments & Academic Discussion
Loading comments...
Leave a Comment