Are Large Language Models Really Effective for Training-Free Cold-Start Recommendation?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recommender systems usually rely on large-scale interaction data to learn from users’ past behaviors and make accurate predictions. However, real-world applications often face situations where no training data is available, such as when launching new services or handling entirely new users. In such cases, conventional approaches cannot be applied. This study focuses on training-free recommendation, where no task-specific training is performed, and particularly on \textit{training-free cold-start recommendation} (TFCSR), the more challenging case where the target user has no interactions. Large language models (LLMs) have recently been explored as a promising solution, and numerous studies have been proposed. As the ability of text embedding models (TEMs) increases, they are increasingly recognized as applicable to training-free recommendation, but no prior work has directly compared LLMs and TEMs under identical conditions. We present the first controlled experiments that systematically evaluate these two approaches in the same setting. The results show that TEMs outperform LLM rerankers, and this trend holds not only in cold-start settings but also in warm-start settings with rich interactions. These findings indicate that direct LLM ranking is not the only viable option, contrary to the commonly shared belief, and TEM-based approaches provide a stronger and more scalable basis for training-free recommendation.

💡 Research Summary

This paper presents a pioneering and systematic comparison between Large Language Models (LLMs) and Text Embedding Models (TEMs) for a highly challenging yet practical scenario in recommender systems: Training-Free Cold-Start Recommendation (TFCSR). TFCSR refers to the task of making recommendations when no task-specific training data is available and the target user has zero (narrow cold-start) or very few (broad cold-start) past interactions, relying solely on textual profiles or item descriptions.

The authors begin by framing the research landscape through a two-axis classification: whether a model is trained, and the number of interactions the target user has. They identify TFCSR as an under-explored intersection where both constraints apply simultaneously. While LLMs have been actively explored as zero-shot rankers for training-free recommendation, and TEMs have gained recognition for their capabilities, no prior work had directly compared these two paradigms under identical, controlled conditions for TFCSR.

The core of the study is a rigorous experimental setup designed for a fair comparison. Using three public datasets (MovieLens-1M, Job recommendation, Amazon Review) that contain natural language metadata, the authors formulate a standard ranking task. For a given user, the input is either their profile text (narrow CS, m=0) or the text of a few historically interacted items (broad CS, m=1). The goal is to rank a fixed candidate set of 50 items, which includes 3 held-out positive items and 47 randomly sampled negatives. This controlled setup eliminates confounding factors like varying candidate set sizes.

The evaluated methods include:

Sparse Retrieval (BM25): A traditional unsupervised baseline.
Dense Retrieval (TEMs): Various embedding models like multilingual-e5-large, bge-m3, the gte-Qwen2 series, and the Qwen3-Embedding series. User and item texts are encoded into vectors, and ranking is based on cosine similarity.
LLM as a Reranker: LLMs like gpt-4.1-mini, gpt-4.1, and Qwen3-8B are provided with a structured prompt containing user information and the candidate set, and are instructed to output an ordered list of top-10 recommendations.

The key findings are striking and consistent across datasets:

Superiority of TEMs: Modern TEMs, particularly those trained with LLM supervision such as gte-Qwen2-7B-instruct and Qwen3-Embedding-8B, significantly outperformed LLM rerankers in both narrow and broad cold-start settings. They also consistently beat the BM25 baseline.
Underperformance of LLM Rerankers: Even powerful LLMs like gpt-4.1 often failed to outperform BM25 in terms of nDCG, a metric sensitive to ranking quality. Their performance was notably weaker than the best TEMs. Interestingly, Qwen3-8B (LLM) performed far worse than its embedding counterpart Qwen3-Embedding-8B, suggesting that the LLM’s knowledge is more effectively utilized for creating semantic spaces than for direct ranking.
The Importance of LLM-Supervised Training for TEMs: Not all TEMs performed well. Older models without advanced training (e.g., gte-modernbert-base) sometimes underperformed BM25, highlighting that the strength of the top-performing TEMs stems from their training methodology using synthetic data from LLMs.
Generalizability to Warm-Start: The trend of TEMs outperforming LLM rerankers was also observed in experiments with users having richer interactions, indicating the finding’s robustness beyond the strict cold-start scenario.

The study includes an error analysis using user-level win/loss comparisons, which shows that TEMs achieve higher scores for a majority of users, not just a specific subset.

In conclusion, this work challenges the prevailing assumption that direct LLM ranking is the only or most effective solution for training-free recommendation. It provides strong empirical evidence that TEM-based approaches, especially those leveraging LLM capabilities for embedding training, offer a more accurate, scalable (capable of handling large candidate pools), and potentially more cost-effective foundation for TFCSR. The research fills a critical gap in the literature and offers clear guidance for future work in practical, data-scarce recommendation scenarios.

Are Large Language Models Really Effective for Training-Free Cold-Start Recommendation?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment