BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models

BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent years have witnessed a rapid surge in research leveraging Large Language Models (LLMs) for recommendation. These methods typically employ supervised fine-tuning (SFT) to adapt LLMs to recommendation scenarios, and utilize beam search during inference to efficiently retrieve $B$ top-ranked recommended items. However, we identify a critical training-inference inconsistency: while SFT optimizes the overall probability of positive items, it does not guarantee that such items will be retrieved by beam search even if they possess high overall probabilities. Due to the greedy pruning mechanism, beam search can prematurely discard a positive item once its prefix probability is insufficient. To address this inconsistency, we propose BEAR (Beam-SEarch-Aware Regularization), a novel fine-tuning objective that explicitly accounts for beam search behavior during training. Rather than directly simulating beam search for each instance during training, which is computationally prohibitive, BEAR enforces a relaxed necessary condition: each token in a positive item must rank within the top-$B$ candidate tokens at each decoding step. This objective effectively mitigates the risk of incorrect pruning while incurring negligible computational overhead compared to standard SFT. Extensive experiments across four real-world datasets demonstrate that BEAR significantly outperforms strong baselines. Code will be released upon acceptance.


💡 Research Summary

The paper addresses a fundamental mismatch between the training objective of large language model (LLM) based recommender systems and the inference procedure that typically employs beam search. In current practice, recommender systems convert a user’s interaction history into a textual prompt, then fine‑tune the LLM with supervised fine‑tuning (SFT) to maximize the overall probability P(y|x) of the target (positive) item y given the prompt x. At inference time, however, it is infeasible to enumerate all items, so beam search is used: at each decoding step only the top‑B candidate token sequences are kept, and the rest are pruned. The authors observe that a high overall probability does not guarantee that an item survives beam search because a low‑probability prefix can cause early pruning. Empirical analysis on four real‑world sequential recommendation datasets shows that more than 80 % of items that rank within the top‑B overall probability are nevertheless pruned by beam search.

To remedy this, the authors propose BEAR (Beam‑Search‑Aware Regularization), a novel fine‑tuning objective that explicitly incorporates beam‑search dynamics without the prohibitive cost of simulating beam search during training. The key insight is a necessary condition for a positive item to be retained: every token of the item must rank within the top‑B candidate tokens at its decoding step. BEAR adds a regularization term that penalizes violations of this condition. Concretely, for each token t the model computes the conditional probability Pθ(y_t | y_{<t}, x) and determines its rank among all vocabulary tokens. If the rank exceeds B, a hinge‑style loss max(0, rank − B + 1) is incurred. The final loss is a weighted sum of the standard SFT cross‑entropy loss and the BEAR regularization loss. Because the rank can be obtained from the same forward pass used for SFT, BEAR incurs virtually no extra computational overhead.

Experiments compare BEAR against nine recent LLM‑based recommendation baselines across four datasets (books, toys, movies, and a multi‑domain set). BEAR consistently outperforms all baselines, achieving an average relative improvement of 12.5 % in NDCG and Recall. The advantage is most pronounced with small beam widths (e.g., B = 5), where the pruning problem is severe. Ablation studies show that BEAR dramatically reduces the “necessary‑condition‑violation” pruning rate—from over 70 % in vanilla SFT to under 10 %—confirming that the regularizer directly addresses the identified failure mode. Moreover, BEAR is model‑agnostic: it yields similar gains when applied to GPT‑2, LLaMA‑7B, and other LLM backbones, and its training time remains comparable to standard SFT (≈ 1×), whereas a naïve approach that simulates beam search during training would be several times slower.

The paper also discusses limitations and future work. BEAR’s effectiveness diminishes as the beam width grows, because the necessary condition becomes easier to satisfy. The token‑rank regularizer may be overly strict for rare tokens, suggesting the need for adaptive weighting or smoothing. Potential extensions include dynamic beam‑width adaptation, integration with multimodal prompts, and coupling with graph‑based user‑item interaction models.

In summary, BEAR provides a principled, low‑cost solution to the training‑inference inconsistency in LLM‑based recommender systems. By ensuring that each token of a positive item stays within the top‑B candidates during decoding, it prevents premature pruning and translates directly into higher recommendation accuracy, while preserving the scalability of standard supervised fine‑tuning. This work paves the way for more reliable deployment of LLMs in real‑world recommendation pipelines that rely on beam search for efficient inference.


Comments & Academic Discussion

Loading comments...

Leave a Comment