Effectiveness of LLMs in Temporal User Profiling for Recommendation
đĄ Research Summary
The paper investigates whether large language models (LLMs) can be used to create temporally aware user profiles that improve recommendation accuracy while also providing intrinsic interpretability. Traditional profiling methodsâsuch as averaging item embeddings or using sequential neural modelsâtypically conflate shortâterm, fleeting interests with longâterm, stable preferences, which hampers both predictive performance and explainability. To address this, the authors propose a threeâstage pipeline. First, for each user they query an LLM twice with carefully crafted prompts: one that asks the model to summarize recent interactions (shortâterm profile) and another that asks for a summary of enduring behaviors across the entire history (longâterm profile). The resulting naturalâlanguage texts (NLshort and NLlong) are then encoded with a pretrained BERT variant (MiniLMâL6âv2) to obtain dense vectors rshort and rlong. Second, a learnable attention layer computes scalar weights Îąshort and Îąlong from these vectors, allowing the model to dynamically balance the influence of recent versus historical interests. The final user embedding e_u = Îąshort¡rshort + Îąlong¡rlong is concatenated with an item embedding e_i and fed into a multilayer perceptron (MLP) that predicts interaction likelihood via binary crossâentropy loss.
Experiments are conducted on two Amazon domains: Movies&TV (high activity, average 11.79 interactions per user) and Video Games (low activity, average 4.55 interactions). Baselines include a naĂŻve âCentricâ averageâembedding method, popularity ranking, matrix factorization (MF), and a temporalâfusion model (TempâFusion) that aggregates numerical shortâ and longâterm embeddings without any LLMâgenerated text. Evaluation follows a strict perâuser temporal holdâout protocol, reporting Recall@K and NDCG@K.
Results show that the LLMâdriven temporal profiling (LLMâTP) yields the most substantial gains in the highâactivity Movies&TV domain: Recall@10 improves by 17% and NDCG@10 by 14% over the Centric baseline, and it also outperforms TempâFusion. In the sparser Video Games domain, improvements are modest; LLMâTP achieves the highest Recall@20 but is otherwise comparable to or slightly below TempâFusion for smaller K values. The authors attribute this disparity to the fact that, in lowâactivity settings, user preferences tend to be more stable, reducing the benefit of finely separating shortâ and longâterm signals.
Beyond accuracy, the approach offers builtâin explainability. The naturalâlanguage profiles are humanâreadable, and the learned attention weights directly indicate whether a recommendation is driven more by recent or historical interests. FigureâŻ2 in the paper illustrates how these components could be combined to generate userâfacing explanations, although a full userâinterface and user study are left for future work.
The paper also conducts an ablation study on the Movies&TV dataset. Removing either the shortâterm or longâterm textual component degrades performance (â15% loss in Recall@20), confirming that both temporal signals are complementary. Replacing the MLP scorer with a simple dot product causes the steepest drop, highlighting the importance of nonâlinear interaction modeling.
Limitations include the computational cost of invoking LLMs for every user, the sensitivity of results to prompt design, and potential domainâspecific variations in summarization quality. The authors suggest future directions such as employing distilled or quantized LLMs, automated prompt optimization, and systematic user studies to assess explanation satisfaction.
In summary, the study demonstrates that LLMâgenerated, temporally disentangled user profiles can substantially boost recommendation performance in domains with rich, dynamic interaction histories while simultaneously providing a transparent mechanism for explaining recommendations. The benefits, however, are contextâdependent, and practical deployment will require careful consideration of computational overhead and domain characteristics.
Comments & Academic Discussion
Loading comments...
Leave a Comment