ChainRec: An Agentic Recommender Learning to Route Tool Chains for Diverse and Evolving Interests

ChainRec: An Agentic Recommender Learning to Route Tool Chains for Diverse and Evolving Interests
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly integrated into recommender systems, motivating recent interest in agentic and reasoning-based recommendation. However, most existing approaches still rely on fixed workflows, applying the same reasoning procedure across diverse recommendation scenarios. In practice, user contexts vary substantially-for example, in cold-start settings or during interest shifts, so an agent should adaptively decide what evidence to gather next rather than following a scripted process. To address this, we propose ChainRec, an agentic recommender that uses a planner to dynamically select reasoning tools. ChainRec builds a standardized Tool Agent Library from expert trajectories. It then trains a planner using supervised fine-tuning and preference optimization to dynamically select tools, decide their order, and determine when to stop. Experiments on AgentRecBench across Amazon, Yelp, and Goodreads show that ChainRec consistently improves Avg HR@{1,3,5} over strong baselines, with especially notable gains in cold-start and evolving-interest scenarios. Ablation studies further validate the importance of tool standardization and preference-optimized planning.


💡 Research Summary

ChainRec tackles a fundamental limitation of current large‑language‑model (LLM)‑driven recommender agents: the reliance on a fixed reasoning workflow that does not adapt to the diverse and evolving information needs of users. In many realistic scenarios—cold‑start users with sparse histories, sudden interest shifts, or items with limited metadata—the agent must decide what evidence to acquire, in what order, and when to stop before producing a final ranking. To address this, the authors propose a two‑layer architecture.

The first layer is a standardized Tool Agent Library (TAL). The authors collect expert Chain‑of‑Thought (CoT) traces from prior work and real‑world deployments, then cluster and canonicalize recurring reasoning steps (e.g., “summarize user preferences”, “fetch item reviews”, “extract attribute vectors”). Each tool is given a unified input‑output schema and a deterministic memory‑write format, turning heterogeneous evidence‑gathering operations into reusable, composable capabilities.

The second layer is a Planner that learns to route among these tools dynamically. Training proceeds in two stages. In Supervised Fine‑Tuning (SFT), the Planner imitates expert trajectories, learning to invoke the correct tool with the right arguments and to record outputs consistently. In Direct Preference Optimization (DPO), the model is further refined using offline pairwise preference data: for the same user‑item episode, multiple tool‑chain candidates are generated, and the higher‑performing chains (as measured by Hit‑Rate) are preferred. DPO optimizes a closed‑form pairwise loss, avoiding the need for a separate reward model or costly on‑policy rollouts.

Formally, recommendation is cast as a finite‑horizon Markov Decision Process (MDP). The state at step t consists of the target user, the fixed candidate set, and the accumulated memory of past tool actions and their summarized outputs. The action space includes K predefined tools plus a terminal ranking action. Each tool may have preconditions (e.g., requires a user profile) and the planner is constrained by a maximum step budget T_max. The reward is a weighted sum of the final ranking quality (average HR@{1,3,5}) and a penalty proportional to the number of tool steps taken, controlled by a hyper‑parameter λ. The objective is to learn a policy that maximizes expected reward, i.e., high accuracy with minimal evidence‑gathering cost.

Experiments are conducted on AgentRecBench, an interactive recommendation benchmark that simulates realistic user‑item interactions. Three domains are evaluated: Amazon (e‑commerce), Yelp (local services), and Goodreads (books). In each episode the agent receives only the user ID and a candidate set, then may call tools up to the budget before outputting a ranked list. Baselines include (i) fixed‑prompt CoT agents, (ii) traditional collaborative‑filtering and graph‑based sequential models, and (iii) recent agentic recommenders such as RecMind and Agent4Rec. ChainRec consistently outperforms all baselines on average HR@1, HR@3, and HR@5, with gains of 2–5 percentage points. The improvements are especially pronounced in cold‑start and interest‑shift scenarios, where the planner learns to prioritize user‑side evidence (e.g., preference summarization) when history is scarce, or to shift toward short‑term signals (e.g., recent search queries) when long‑term preferences conflict with current intent.

Ablation studies dissect the contributions of each component. Removing tool standardization (using raw API calls) degrades performance, confirming the value of a unified capability layer. Training with SFT alone, without DPO, yields a planner that imitates expert sequences but fails to preferentially select higher‑utility chains, resulting in lower HR. Tightening the step budget reduces the planner’s ability to gather sufficient evidence, again lowering performance. t‑SNE visualizations of CoT embeddings show distinct clusters corresponding to different scenarios, illustrating why a static chain is insufficient.

Overall, ChainRec demonstrates that separating “capability construction” from “policy learning” enables LLM agents to perform structured, cost‑aware evidence gathering and produce more accurate recommendations across heterogeneous domains. The paper suggests future extensions such as incorporating real‑time APIs, multi‑episode learning, online reinforcement learning with user feedback, and richer tool vocabularies to further enhance adaptability and robustness.


Comments & Academic Discussion

Loading comments...

Leave a Comment