MALLOC: Benchmarking the Memory-aware Long Sequence Compression for Large Sequential Recommendation
The scaling law, which indicates that model performance improves with increasing dataset and model capacity, has fueled a growing trend in expanding recommendation models in both industry and academia. However, the advent of large-scale recommenders also brings significantly higher computational costs, particularly under the long-sequence dependencies inherent in the user intent of recommendation systems. Current approaches often rely on pre-storing the intermediate states of the past behavior for each user, thereby reducing the quadratic re-computation cost for the following requests. Despite their effectiveness, these methods often treat memory merely as a medium for acceleration, without adequately considering the space overhead it introduces. This presents a critical challenge in real-world recommendation systems with billions of users, each of whom might initiate thousands of interactions and require massive memory for state storage. Fortunately, there have been several memory management strategies examined for compression in LLM, while most have not been evaluated on the recommendation task. To mitigate this gap, we introduce MALLOC, a comprehensive benchmark for memory-aware long sequence compression. MALLOC presents a comprehensive investigation and systematic classification of memory management techniques applicable to large sequential recommendations. These techniques are integrated into state-of-the-art recommenders, enabling a reproducible and accessible evaluation platform. Through extensive experiments across accuracy, efficiency, and complexity, we demonstrate the holistic reliability of MALLOC in advancing large-scale recommendation. Code is available at https://anonymous.4open.science/r/MALLOC.
💡 Research Summary
The paper introduces MALLOC, a comprehensive benchmark designed to evaluate memory‑aware long‑sequence compression techniques for large‑scale sequential recommendation systems. Motivated by the scaling law that larger models and datasets yield better performance, the authors highlight a critical “Memory–Latency Dilemma”: storing full key‑value (KV) caches for billions of users exceeds GPU memory limits, while recomputing KV states incurs prohibitive MAC costs. Existing solutions either keep full caches (high memory) or recompute (high latency), making them unsuitable for real‑time services.
MALLOC addresses this dilemma by classifying compression methods according to the granularity of memory allocation: (1) Sequence‑level compression reduces the entire sequence length (e.g., Linformer, Reformer) via low‑dimensional projections; (2) Token‑level compression prunes or merges less important tokens, preserving core user actions while cutting token count; (3) Head‑level compression groups or shares attention heads (e.g., MQA, GQA, MLA) to eliminate redundant KV storage; and (4) Precision‑level compression quantizes KV tensors from FP32 to lower‑precision formats (FP16, INT8), shrinking storage bandwidth.
The benchmark integrates these four families into state‑of‑the‑art recommender backbones such as MARM, MTGR, and HSTU, ensuring identical data preprocessing, model architecture, and training pipelines. Evaluation metrics extend beyond standard ranking scores (HR@K, NDCG@K) to include computational overhead (MACs) and memory footprint (GB).
Experiments on multiple real‑world datasets (e‑commerce logs, video streaming histories) reveal distinct trade‑offs. Sequence‑level compression can cut memory usage by over 80 % with less than 2 % loss in accuracy. Token‑level and head‑level methods reduce MACs by 30‑50 % while keeping accuracy degradation under 1 %. Precision‑level quantization dramatically improves cache bandwidth, lowering inference latency by more than 40 % and increasing overall throughput by ~1.5×.
Beyond performance, the authors assess implementation complexity. Sequence‑level approaches require substantial architectural changes and hardware‑specific tuning, whereas token‑ and head‑level techniques can be added as lightweight plugins with modest code modifications. Quantization leverages existing framework APIs and thus offers the best engineering return.
MALLOC visualizes the accuracy‑resource trade‑off as a Pareto frontier, enabling practitioners to select the most appropriate compression strategy given specific memory or latency constraints. The paper concludes with practical deployment guidelines: prioritize sequence‑level compression when memory is the bottleneck, combine token‑ and head‑level methods when compute is limited, and apply quantization universally for the highest cost‑effectiveness.
In summary, MALLOC provides a unified, reproducible platform for systematic study of memory‑aware compression in large sequential recommenders, bridging the gap between academic research and industrial deployment by offering both rigorous evaluation and actionable engineering insights.
Comments & Academic Discussion
Loading comments...
Leave a Comment