ALPBench: A Benchmark for Attribution-level Long-term Personal Behavior Understanding

ALPBench: A Benchmark for Attribution-level Long-term Personal Behavior Understanding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in large language models have highlighted their potential for personalized recommendation, where accurately capturing user preferences remains a key challenge. Leveraging their strong reasoning and generalization capabilities, LLMs offer new opportunities for modeling long-term user behavior. To systematically evaluate this, we introduce ALPBench, a Benchmark for Attribution-level Long-term Personal Behavior Understanding. Unlike item-focused benchmarks, ALPBench predicts user-interested attribute combinations, enabling ground-truth evaluation even for newly introduced items. It models preferences from long-term historical behaviors rather than users’ explicitly expressed requests, better reflecting enduring interests. User histories are represented as natural language sequences, allowing interpretable, reasoning-based personalization. ALPBench enables fine-grained evaluation of personalization by focusing on the prediction of attribute combinations task that remains highly challenging for current LLMs due to the need to capture complex interactions among multiple attributes and reason over long-term user behavior sequences.


💡 Research Summary

Title: ALPBench: A Benchmark for Attribution‑level Long‑term Personal Behavior Understanding

Overview
The paper introduces ALPBench, a novel benchmark designed to evaluate large language models (LLMs) on their ability to understand and personalize based on long‑term user behavior. Unlike traditional recommendation benchmarks that ask models to predict the next item a user will interact with, ALPBench reframes the task as predicting the attribute combination of the item a user is most likely to purchase next. This shift decouples stable user preferences from transient system dynamics such as item popularity, cold‑start effects, or platform‑wide exposure strategies, enabling a cleaner assessment of a model’s reasoning and personalization capabilities.

Data Construction
The dataset is built from real‑world Chinese e‑commerce logs collected on the Kuaishou platform. The authors follow a four‑stage pipeline:

  1. Category Selection – High‑frequency product categories (e.g., Pants, Shoes, Snacks, Baijiu, Badminton Racket, Cell Phones, Fishing Rods) are chosen to ensure reliable metadata.
  2. User Filtering & Denoising – Users are retained only if they have at least one confirmed purchase during a “shopping festival” period, which the authors argue reflects deliberate, stable preferences. Low‑intent or noisy interactions are removed.
  3. Context Cleaning – For each interaction, three textual fields are extracted: product title, curated selling points, and price tier (low/medium/high). Non‑informative tokens (URLs, duplicates) are filtered, and a LLM (Gemini‑2.5‑Pro) is used to normalize attribute values and remove residual noise.
  4. Human‑Model Review – Because many samples contain very long sequences (up to several thousand tokens), a hybrid review process is employed: the model first generates a candidate answer with an explicit reasoning trace, and human annotators verify the logical coherence and factual support. Final judgments are made by humans.

The resulting benchmark contains three temporal horizons—3‑month, 6‑month, and 12‑month histories—each ranging from roughly 0.9k to 2.2k tokens per user. For each user, the task is to output a preference profile y = {v₁,…,v_k}, where each v_j belongs to a predefined candidate set V_j for attribute A_j of the target category C. The prediction must be a joint selection from the combinatorial space V = V₁ × … × V_k, not independent per‑attribute classifications.

Task Formalization
Given a user’s long‑range behavior sequence S_u, the category C, the attribute set A_C, and candidate value sets {V_j}, the model must compute:

\


Comments & Academic Discussion

Loading comments...

Leave a Comment