Efficient Learning of Sparse Representations from Interactions

Efficient Learning of Sparse Representations from Interactions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Behavioral patterns captured in embeddings learned from interaction data are pivotal across various stages of production recommender systems. However, in the initial retrieval stage, practitioners face an inherent tradeoff between embedding expressiveness and the scalability and latency of serving components, resulting in the need for representations that are both compact and expressive. To address this challenge, we propose a training strategy for learning high-dimensional sparse embedding layers in place of conventional dense ones, balancing efficiency, representational expressiveness, and interpretability. To demonstrate our approach, we modified the production-grade collaborative filtering autoencoder ELSA, achieving up to 10x reduction in embedding size with no loss of recommendation accuracy, and up to 100x reduction with only a 2.5% loss. Moreover, the active embedding dimensions reveal an interpretable inverted-index structure that segments items in a way directly aligned with the model’s latent space, thereby enabling integration of segment-level recommendation functionality (e.g., 2D homepage layouts) within the candidate retrieval model itself. Source codes, additional results, as well as a live demo are available at https://github.com/zombak79/compressed_elsa


💡 Research Summary

The paper tackles a fundamental scalability problem in large‑scale recommender systems: the memory and latency cost of dense item embeddings used in the candidate retrieval stage. While dense embeddings provide high expressive power, production constraints force engineers to compress them, often at the expense of accuracy. The authors propose a training‑time sparsification method that replaces the dense embedding matrix of the well‑known linear auto‑encoder model ELSA with a high‑dimensional sparse matrix, thereby achieving dramatic reductions in storage while preserving recommendation quality.

The core technical contribution is the introduction of a row‑wise top‑k sparsification operator S_k applied during training. For each item row in the embedding matrix A, only the k largest‑in‑absolute‑value entries are kept; the rest are zeroed out and the row is ℓ2‑normalized, yielding a sparse matrix \bar A_s. The loss function remains the same auto‑encoder reconstruction loss L(·) but now operates on \bar A_s \bar A_s^T − I. Directly fixing k from the start leads to “dead latent” dimensions, so the authors design gradual pruning schedules: k is decreased over training steps according to constant, linear, exponential, or step‑wise decay. They also compare two post‑pruning strategies—re‑initializing the remaining parameters versus continuing training from the pruned state. Empirically, decreasing k after each epoch outperforms waiting for full convergence, and re‑initialization yields a modest additional gain.

Storage cost for a sparse row with k non‑zeros is (4 bytes value + 4 bytes index) × k, i.e., 8 bytes per active dimension. Consequently, a vector with k = 128 occupies only 1 KB, allowing the model to keep many more latent dimensions than a dense 256‑dimensional vector (also 1 KB). Inference uses both \bar A_s and its transpose stored in CSC format, enabling fast sparse matrix‑vector multiplication with O(n k) complexity, which reduces both memory bandwidth and latency.

Beyond compression, the paper shows that the learned sparse codes have an interpretable structure. For each item, the dominant latent dimension (largest absolute entry) and its sign define a “latent factor”. Items sharing the same (dimension, sign) are grouped, and their metadata (e.g., book titles, genres) are fed to a sentence‑embedding model to generate short textual descriptors. Groups with semantically similar descriptors are merged using a cosine‑similarity threshold, producing a set of coherent “segments”. A segment matrix B_s is built where each row contains only the signed dimensions that define the segment; after ℓ2‑normalization, the segment matrix lives in the same space as the item embeddings. Recommendation scores for segments are computed as x^T \bar A_s \bar B_s^T, allowing a single user vector to rank both individual items and higher‑level segments simultaneously.

Experiments on three public datasets—Goodbooks‑10k, MovieLens‑20M, and Netflix Prize—measure nDCG@100 across a range of storage budgets. Compressed ELSA achieves near‑identical performance to the original dense ELSA (e.g., 0.491 vs. 0.489 on Goodbooks‑10k) while using 10× less memory (1 KB vs. 10 KB). Even at 100× compression (64 B), the loss is modest (0.469 vs. 0.489). The method outperforms several baselines: low‑dimensional ELSA, post‑hoc sparsification via a Sparse Auto‑Encoder (ELSA+SAE), and Pruned EASE, demonstrating a superior accuracy‑size trade‑off.

Qualitative analysis confirms that the latent dimensions correspond to meaningful item groups such as “Children’s Classics”, “Detective Fiction”, or “Thrilling Mysteries”. Visualizations show that a user’s sparse activation vector aligns with the segments recommended to them, providing a clear, interpretable path from latent activation through segment relevance to final item recommendations.

The authors acknowledge limitations: the approach benefits from starting with a relatively large latent dimension (e.g., d = 4096), which incurs higher training cost; diminishing returns appear beyond a certain size. Moreover, segment labeling currently relies on external metadata, so fully unsupervised segment semantics remain an open question. Future work may explore automated segment naming, adaptive pruning schedules, and application of the sparsification technique to other collaborative‑filtering architectures.

In summary, the paper presents a practical, training‑time sparsification framework that dramatically reduces embedding size, speeds up inference, and yields interpretable segment structures, all while maintaining state‑of‑the‑art recommendation accuracy.


Comments & Academic Discussion

Loading comments...

Leave a Comment