Internalizing Multi-Agent Reasoning for Accurate and Efficient LLM-based Recommendation

Internalizing Multi-Agent Reasoning for Accurate and Efficient LLM-based Recommendation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) are reshaping recommender systems by leveraging extensive world knowledge and semantic reasoning to interpret user intent. However, effectively integrating these capabilities with collaborative signals while avoiding prohibitive inference latency remains a critical bottleneck. To address this, we propose a trajectory-driven internalization framework to develop a Single-agent Trajectory-Aligned Recommender (STAR). Specifically, to internalize complex reasoning capabilities into a single efficient model, we first design a multi-agent teacher system capable of multi-turn tool usage and reflection. This teacher utilizes a Collaborative Signal Translation mechanism to explicitly convert latent behavioral patterns into descriptive natural language evidence to enhance reasoning accuracy. Subsequently, a trajectory-driven distillation pipeline transfers this agentic logic, including planning, tool usage, and self-reflection, into the compact STAR model. Extensive experiments demonstrate that STAR surpasses its teacher by 8.7% to 39.5% while eliminating iterative latency, paving the way for real-time, reasoning-enhanced recommendation.


💡 Research Summary

The paper tackles the longstanding tension in recommender systems between the rich semantic reasoning capabilities of large language models (LLMs) and the proven effectiveness of collaborative‑filtering signals, while also addressing the latency problem that plagues multi‑turn tool‑augmented agents. The authors introduce a two‑phase framework. In the first phase they build a multi‑agent teacher system called MARS (Multi‑Agent Recommender System). MARS follows a Plan‑Execute‑Reflect paradigm: a Planner decomposes a user’s request into sub‑tasks, specialized Execution Agents retrieve and synthesize collaborative evidence, a Reflector checks consistency, and a Ranking Agent produces the final item list. A key novelty is the Collaborative Signal Translation mechanism. By constructing a global user‑item bipartite graph, the system runs two graph‑traversal tools—Item‑CF (Item → User → Item) and User‑CF (User → Item → User)—to extract high‑order co‑occurrence and user similarity information. Instead of returning raw IDs, the tools feed the retrieved neighbor sets to an LLM that generates concise natural‑language summaries (“users who read The Three‑Body Problem also like Dune and Foundation”). These summaries are pre‑computed and stored, so at inference time the agent can retrieve them instantly without costly graph queries.

MARS, however, is computationally heavy because it requires multiple turns and tool calls. To obtain a production‑ready model, the second phase distills MARS into a single‑agent model named STAR (Single‑agent Trajectory‑Aligned Recommender). The authors serialize the entire multi‑agent interaction log into a linear chain‑of‑thought format, explicitly marking planning, tool calls, reflections, and recommendations with special tokens (, <tool_call>, , ). Only trajectories where the teacher’s top‑1 prediction matches the ground‑truth are kept, ensuring high‑quality supervision. Stage 1 applies Supervised Fine‑Tuning (SFT) on these trajectories, teaching the student to mimic the planner’s decomposition, the syntax for invoking tools, and the structured output format. Stage 2 employs Group Relative Policy Optimization (GRPO), a reinforcement‑learning method that samples a group of candidate outputs per input, then optimizes a composite reward comprising format adherence and prediction accuracy. This encourages the student not merely to imitate but to learn when to trigger the collaborative‑signal tools and how to perform self‑reflection, effectively internalizing the teacher’s decision‑making logic.

Extensive experiments across several domains (books, movies, e‑commerce) show that STAR consistently outperforms its teacher by 8.7 %–39.5 % on ranking metrics such as HR@10 and NDCG@20, while reducing inference latency by a factor of five or more. Because the natural‑language evidence is pre‑computed, STAR’s runtime memory and storage footprints are modest, making it suitable for real‑time serving on commodity hardware. The paper’s contributions are threefold: (1) a novel Collaborative Signal Translation mechanism that bridges graph‑based collaborative data with LLM reasoning; (2) a trajectory‑driven distillation pipeline that combines SFT and GRPO to embed multi‑agent planning, tool usage, and reflection into a single model; and (3) the STAR model itself, which achieves superior accuracy and efficiency, demonstrating that sophisticated multi‑agent reasoning can be internalized without sacrificing real‑time performance. This work opens a new direction for LLM‑enhanced recommender systems, showing that explicit reasoning over collaborative signals can be both accurate and scalable.


Comments & Academic Discussion

Loading comments...

Leave a Comment