Exploring Diversity, Novelty, and Popularity Bias in ChatGPT s Recommendations
📝 Original Paper Info
- Title: Exploring Diversity, Novelty, and Popularity Bias in ChatGPT s Recommendations- ArXiv ID: 2601.01997
- Date: 2026-01-05
- Authors: Dario Di Palma, Giovanni Maria Biancofiore, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia
📝 Abstract
ChatGPT has emerged as a versatile tool, demonstrating capabilities across diverse domains. Given these successes, the Recommender Systems (RSs) community has begun investigating its applications within recommendation scenarios primarily focusing on accuracy. While the integration of ChatGPT into RSs has garnered significant attention, a comprehensive analysis of its performance across various dimensions remains largely unexplored. Specifically, the capabilities of providing diverse and novel recommendations or exploring potential biases such as popularity bias have not been thoroughly examined. As the use of these models continues to expand, understanding these aspects is crucial for enhancing user satisfaction and achieving long-term personalization. This study investigates the recommendations provided by ChatGPT-3.5 and ChatGPT-4 by assessing ChatGPT's capabilities in terms of diversity, novelty, and popularity bias. We evaluate these models on three distinct datasets and assess their performance in Top-N recommendation and cold-start scenarios. The findings reveal that ChatGPT-4 matches or surpasses traditional recommenders, demonstrating the ability to balance novelty and diversity in recommendations. Furthermore, in the cold-start scenario, ChatGPT models exhibit superior performance in both accuracy and novelty, suggesting they can be particularly beneficial for new users. This research highlights the strengths and limitations of ChatGPT's recommendations, offering new perspectives on the capacity of these models to provide recommendations beyond accuracy-focused metrics.💡 Summary & Analysis
1. **Principle Explanation**: - **Basic**: Recommender systems predict user preferences and deliver personalized content to help find valuable information. ChatGPT aids in generating better recommendations. - **Intermediate**: While traditional recommender systems focus on accuracy, ChatGPT excels in various aspects such as diversity, novelty, and popularity bias, offering users a wider range of experiences. - **Advanced**: As a Large Language Model (LLM), ChatGPT is proficient at text generation. Integrating it with recommendation systems enables broader and more nuanced explanations.-
Research Methodology:
- Basic: The research team evaluated the diversity, novelty, and popularity bias of recommendations using ChatGPT across various datasets.
- Intermediate: They developed a methodology utilizing role-playing prompts which reduces duplicate recommendations and provides better results.
- Advanced: Prompt designs were inspired by GPT-3 success cases and various prompt techniques were used to optimize model performance.
-
Significance of Research:
- Basic: This research helps understand how ChatGPT can be integrated into recommendation systems.
- Intermediate: Evaluations using diverse datasets demonstrate that ChatGPT could provide better user experiences.
- Advanced: The study analyzes how ChatGPT operates beyond accuracy in terms of diversity, novelty, and popularity bias across different dimensions.
📄 Full Paper Content (ArXiv Source)
[ email=giovannimaria.biancofiore@poliba.it, ]
[ email=vitowalter.anelli@poliba.it, ]
[ email=fedelucio.narducci@poliba.it, ]
[ email=tommaso.dinoia@poliba.it, ]
ChatGPT , Recommender Systems (RSs) , Large Language Models (LLMs) , Diversity , Novelty , Popularity Bias , Cold-Start
Introduction
Recommender systems (RSs) have long assisted users in discovering valuable information on the web by predicting their preferences and delivering personalized content. Over time, these systems have evolved from Matrix Factorization approaches to modern architectures that extend state-of-the-art Deep Learning models , originally developed for other domains such as time-series forecasting , natural language processing , and computer vision . Despite significant progress in improving accuracy, current research in the user modeling and personalization community increasingly emphasizes the importance of beyond-accuracy perspectives such as diversity, novelty, and popularity bias. These factors not only impact overall system effectiveness but also influence user satisfaction , long-term engagement , and fairness .
With the release of ChatGPT in November 2022 , Large Language Models (LLMs) have begun to reshape how recommendations can be delivered. Unlike traditional RSs that rely on carefully structured training data, LLMs can generate free-form text, potentially offering more nuanced explanations and broader item coverage by leveraging their vast knowledge. Consequently, the research community is now experimenting with LLM-driven recommendation pipelines , demonstrating notable successes in improving recommendation accuracy . However, most existing studies on ChatGPT-based recommender systems have emphasized improving accuracy while neglecting the beyond-accuracy dimensions that are critical for real-world impact .
Ignoring the beyond-accuracy behavior of ChatGPT creates a black box for researchers, making it difficult to determine whether it over-recommends popular items, reduces novelty, or offers less diverse recommendations, all of which may negatively impact user satisfaction and long-term personalization goals. Early investigations focused on how to use ChatGPT for re-ranking recommendations , while others began to study the serendipity of the generated recommendations or explored how ChatGPT generates recommendations and whether its outputs align more closely with content-based or collaborative filtering approaches . Only investigates biases in ChatGPT-based recommender systems, with a specific focus on provider fairness. While a few works have begun examining potential biases related to sensitive attributes such as race, gender, and religion , aspects like recommendation diversity, novelty, and popularity bias in ChatGPT remain largely unexplored. Addressing these gaps is essential to ensure that personalization technologies are both effective and fair.
To this end, we analyze ChatGPT’s recommendation behavior, focusing on both ChatGPT-3.5 and ChatGPT-4 across multiple beyond-accuracy metrics. Specifically, we investigate whether ChatGPT generates diverse and novel recommendations or exhibits popularity bias, both under normal conditions and in user cold-start scenarios where users have interacted with only a few items. Our evaluation spans three distinct domains, Books, Movies, and Music, using the Facebook Books , MovieLens , and Last.FM datasets as benchmarks, aiming to answer the following Research Questions (RQs):
-
Are ChatGPT’s recommendations diverse?
-
Are ChatGPT’s recommendations novel?
-
Is ChatGPT affected by popularity bias?
-
How effective is ChatGPT in user cold-start scenario across accuracy and beyond-accuracy dimensions?
Related Work
Diversity, Novelty, and Popularity Bias in Recommender Systems. Driven by the need for Recommender Systems (RSs) to enhance user engagement , this work focuses on beyond-accuracy measures of RSs, namely, diversity, novelty, and popularity bias, to investigate how these factors affect the recommendation lists provided by ChatGPT.
There was a moment in the evolution of RSs when researchers realized that evaluating recommendations solely based on accuracy metrics was insufficient. For instance, suggest that the performance of recommendations should be measured by their usefulness to the user. Similarly , in their survey on the evaluation of RSs, suggest that recommendations can be evaluated based on utility, novelty, diversity, unexpectedness, serendipity, and coverage. Karimi, Jannach, and Jugovac , in their review of state-of-the-art news RSs, identify diversity, novelty, and popularity as the most common quality factors for improving recommendations. Specifically, diversity and novelty are often considered quality factors that must be balanced with prediction accuracy , and the most discussed beyond-accuracy objectives in recommender system research .
As interest in studying RSs beyond accuracy metrics spread, more studies began to use these metrics as goals for improvement. For example, and focused on creating RSs that not only predict accurate items but also achieve a high level of diversity in recommendations. employed a graph-based approach to identify items with higher novelty, while proposed a method to mitigate popularity bias. Furthermore, the work by emphasizes the importance of evaluating RSs beyond accuracy, proposing a multi-objective evaluation approach.
Although many works use beyond-accuracy metrics to evaluate and improve
RSs, prior literature lacks a unified framework that rigorously defines
diversity, novelty, and bias, leading to vagueness and overlap among
these measures. In our study, we define these concepts as follows:
Diversity is the extent to which a recommender system suggests a wide
range of items from the catalog 1, as supported by and . Novelty is
the degree to which recommended items expose users to relevant
experiences they are unlikely to discover independently, based on .
Popularity Bias refers to the tendency of recommender systems to favor
popular items, those with many interactions, over less popular or niche
items, aligning with and .
ChatGPT-based recommendation. A first example of ChatGPT for
recommendation is proposed by , who introduced ChatREC, a
ChatGPT-augmented recommender system that translates the recommendation
task into an interactive conversation with users. The authors proposed a
prompt template to convert user information and user-item interactions
into a query for ChatGPT. However, the system was evaluated solely using
accuracy metrics (i.e., Recall, Precision, nDCG). investigates ChatGPT’s
performance in a multi-turn conversational recommendation setting,
demonstrating its potential as a conversational recommender and showing
that it outperforms traditional methods. focused their work on ChatGPT
in zero-shot settings. They analyzed ChatGPT models by designing a
dedicated prompting template and revealed that ChatGPT-4 achieved the
highest ranking performance compared to other LLMs in the zero-shot
recommendation task.
investigated the abilities of ChatGPT as a recommender systems for the Top-N recommendation task, aiming to identify the most effective prompting strategy for producing relevant recommendations. The authors concluded that the zero-shot setting yields the most relevant recommendation list, outperforming content-based baselines. However, their conclusions were based solely on nDCG as the evaluation metric, which limits the findings to only one dimension of RSs.
investigate ChatGPT’s abilities in suggesting items through rating prediction, pairwise recommendation, and re-ranking strategies using the prompting approach. Their experiments, conducted on four domains, demonstrate ChatGPT’s abilities to recommend items. Nonetheless, this study provide only an accuracy view of ChatGPTs’ capabilities in the Top-N recommendation.
focus on applying ChatGPT within the book recommendation scenario, designing BookGPT to address single-item and rating prediction tasks. However, the study does not provide a generalizable analysis of ChatGPT’s performance across multiple domains, as the authors focus only on the book domain.
Although all the presented works focus on using ChatGPT to improve the performance of recommender systems, they are primarily based on accuracy metrics. To address this gap, our work investigates the task of Top-N recommendation, moving beyond accuracy by evaluating ChatGPT’s performance in terms of diversity, novelty, and popularity bias, while also highlighting its beyond-accuracy capabilities in user cold-start scenarios.
Methodology
The following sections discuss the methodology used in our research, outline the design of the prompts employed to collect recommendations from ChatGPT, detail the datasets used in the experiments, present the baselines for comparison, and list the metrics used to assess diversity, novelty, and popularity bias.
Prompt Design
The introduction of GPT-3 demonstrated the ability of LLMs to perform diverse tasks when provided with clear, task-specific prompts, showing how prompts condition the model’s response and play a critical role in shaping its performance on a given task .
With the widespread diffusion of ChatGPT, the literature on prompt engineering has expanded, moving from basic prompts such as zero- and few-shot to more complex prompts like Chain-of-Thought , Tree-of-Thoughts , Reflexion , or Graph-Prompting . Among the various prompt techniques , we hand-engineered Zero-Shot, Few-Shot, Chain-of-Thought, and Role-Playing (RP) prompting following the works of and , to identify the best approach for our investigation.
In the following, we present the hand-engineered prompts and explain the main reasons for selecting RP prompting as the primary technique for our investigation. Specifically, for all the tested prompts and for each user, the input consists of the user’s history, presented as a list of items formatted as follows: $`\{History\ of\ the\ User\}: Item_1, Item_2, \ldots, Item_{N}`$.
Zero-shot prompting . In zero-shot prompting, we directly provided the user’s history to ChatGPT and asked for 50 recommendations, as shown in the reference example (see fig. 1). However, $`\sim`$71% of the generated lists contained fewer than 50 items or included repeated entries, and $`\sim`$6% exhibited incorrect task execution.
Few-shot prompting . After zero-shot prompting, we tested few-shot prompting by providing a few demonstrations of recommendations to help the LLM better understand the task (see Fig. 2). While these contextual examples reduced execution errors, $`\sim`$44% of the generated lists contained duplicate items.
Chain-of-Thought (CoT) prompting . Using CoT, we attempted to break the recommendation task into explicit steps to force ChatGPT to reason step-by-step. As shown in Fig. 3, we explicitly defined the instructions, the user’s preferences, and the steps to identify the most suitable recommendations. This approach produced excessive tokens, reaching the context limit after generating $`\sim`$26 items.
Role-Playing prompting . Following the work of and , we also tested Role-Playing prompts, where ChatGPT impersonates a Recommender System and recommends items based on the user’s history (see Fig. 4). This strategy proved the most effective, eliminating duplicate recommendations.
After testing 30 hand-crafted prompts and aligning with studies on Role-Playing Prompting , we selected this approach for its ability to reduce duplicates and token usage. In this setup, ChatGPT acts as a Recommender System, generating 50 recommendations based on the user’s history (see Fig. 4).
Experimental Setup
This section outlines the experimental setup, including the datasets, baselines, and metrics used to assess the beyond-accuracy performance of ChatGPT’s recommendations, with a focus on diversity, novelty, and popularity bias.
Datasets. We evaluated ChatGPT on three well-known recommendation datasets, namely MovieLens100k , Last.FM , and Facebook Books 2. To enhance data quality, we applied an iterative 10-core filtering strategy , retaining only users and items with at least ten interactions. Table 1 holds the dataset statistics after preprocessing.
| Dataset | Interaction | Users | Items | Sparsity | Content |
|---|---|---|---|---|---|
| MovieLens | 42,456 | 603 | 1,862 | 96.22% | genre |
| Last.FM | 49,171 | 1,797 | 1,507 | 98.18% | genre |
| FB Books | 13,117 | 1,398 | 2,234 | 99.58% | genre, author |
Dataset statistics after pre-processing with .
Baseline Models. To measure the effectiveness of ChatGPT, we
experimentally compare its performance with state-of-the-art baselines
from three categories: Non-Personalized, Collaborative Filtering, and
Content-Based Filtering methods. To ensure a fair comparison, we train
the baselines and optimize their hyperparameters using the Elliot
framework , and split the dataset into 80% training and 20% test sets,
following the all unrated items evaluation protocol . The code used for
the experiments is publicly available at:
https://github.com/sisinflab/beyond-accuracy-recsys-chatgpt
. Below, we
describe the baselines, grouped by recommendation category.
Non-Personalized. Random and Most Popular return random
recommendations and the most popular recommendations, respectively, and
are used as reference points. Collaborative Filtering. To compare the
effect of ChatGPT recommendations on beyond-accuracy metrics, we
selected the following collaborative filtering methods, each focusing on
different aspects. Specifically, we selected $`\textrm{RP}^3_{\beta}`$
and LightGCN for their demonstrated ability to maintain accuracy while
preserving diversity . ItemKNN , UserKNN , and EASE$`^R`$ were chosen
for their emphasis on relevance and personalization . Finally, MF2020
and NeuMF were included as a tradeoff between model complexity and
effectiveness. Content-Based Filtering. We further extend our
comparison by including content-based models, which prioritize explicit
feature representations and offer a meaningful contrast to collaborative
models. This allows us to evaluate ChatGPT against the most appropriate
model type for the dataset. Specifically, we include VSM which
represents items as vectors in a high-dimensional space, with each
dimension corresponding to a feature, as well as AttributeItemKNN and
AttributeUserKNN , which rely on TF-IDF-weighted attribute vectors to
compute similarities and generate recommendations.
Ensuring Recommendation Consistency. ChatGPT models generate
recommendations based solely on the user profile provided in the prompt,
without being constrained to a predefined dataset. As a result, they may
hallucinate or suggest real items not present in the reference dataset,
leading to discrepancies in item names and inconsistencies in
evaluation.
To address this, we adopt a post-processing pipeline that uses Gestalt pattern matching to identify the closest match in the dataset, accepting items with a similarity score above 90% (empirically determined). Unmatched items are flagged as External Items, originating from the LLM’s pre-trained knowledge, and excluded from evaluation to ensure a fair comparison with traditional recommenders by selecting in-catalogue items.
Since this final step could affect our evaluations, we verified that out-of-catalogue items consistently appeared beyond the top-10 positions in all recommendation lists, ensuring that rank-sensitive metrics remain unaffected and preserving the validity of our evaluation. In our configuration, ChatGPT placed these items only after rank 23, suggesting 2,740 out-of-catalogue items for Books, 870 for Music, and 234 for Movies.
Finally, to ensure a fair comparison, we evaluate all models and ChatGPT
results at a cutoff of 10 (i.e., Top-10 recommendations per user),
following widely accepted practices in recommendation .
Evaluation Metrics. While our primary focus is on the
beyond-accuracy aspects of ChatGPT’s recommendations, it is also
important to include accuracy metrics to assess whether the
recommendations are relevant to users. For this purpose, we use two
standard metrics: Precision and Recall . Higher values of Precision and
Recall indicate that the recommender system provides a greater number of
relevant items. Additionally, we evaluate the ranking quality of the
recommendations using the Normalized Discounted Cumulative Gain (nDCG) ,
where higher values indicate better recommendation lists.
For beyond-accuracy metrics, we selected a set of measures to evaluate diversity, novelty, and popularity bias. The specific metrics considered are detailed in Table [tab:beyond_accuracy_table].
Experimental Results
ChatGPT Beyond-Accuracy Recommendation Performance
In this section, we discuss the empirical findings from
Table [tab:combined_all], focusing on
(RQ1.) the diversity of ChatGPT’s recommendations, (RQ2.) their novelty,
and (RQ3.) the extent to which ChatGPT is affected by popularity bias.
The evaluation comprises three datasets, Facebook Books, Last.FM, and
MovieLens, and compares ChatGPT‑3.5 and ChatGPT‑4 against both
Collaborative Filtering and Content‑Based Filtering baselines.
Statistically significant differences (paired t‑tests at $`p<0.05`$) are
noted where indicated in the table.
Preliminary Accuracy Analysis. Before examining diversity, novelty,
and popularity bias, we first verify that ChatGPT’s recommendations
fulfill the primary goal of offering relevant items. We use nDCG,
Recall, and Precision as standard accuracy metrics. Higher values on
these metrics imply better recommendation.
Overall, ChatGPT demonstrates a comparable level of accuracy in recommendation scenarios. Specifically, on Facebook Books, ChatGPT-4 attains the highest nDCG overall (0.0932), significantly outperforming the best baseline, AttributeItemKNN (0.0479), as well as ChatGPT-3.5 (0.0668). Recall and Precision follow a similar pattern to nDCG.
For Last.FM, while ChatGPT-4 (nDCG = 0.2832) does not surpass the best Collaborative Filtering (CF) approach ($`\textrm{RP}^3_{\beta}`$: 0.3147), it still ranks among the top-performing algorithms. ChatGPT-3.5 trails behind ChatGPT-4 but still outperforms some baselines (e.g., EASE$`^R`$, AttributeItemKNN).
For MovieLens, although ChatGPT-4 improves upon ChatGPT-3.5 across all accuracy metrics, raising nDCG from 0.1475 to 0.1815 and Precision from 0.1120 to 0.1551, certain CF algorithms (e.g., $`\textrm{RP}^3_{\beta}`$: 0.2827 nDCG, 0.2708 Precision) achieve significantly higher scores. Nonetheless, ChatGPT’s accuracy levels comfortably exceed those of some methods, such as VSM (0.0174 nDCG) and AttributeItemKNN (0.0326 nDCG).
These results demonstrate that both ChatGPT-3.5 and ChatGPT-4 achieve
valid and reasonable performance on accuracy metrics. This preliminary
evaluation ensures that the subsequent analysis of diversity, novelty,
and popularity bias is based on recommendations that already meet the
accuracy standard. In the following sections, our analysis is divided
according to the research questions (RQs).
(RQ1.) Are ChatGPT’s recommendations diverse? We assess diversity
using Gini and Item Coverage (ItemCV). A lower Gini indicates a higher
concentration toward certain items, while higher coverage values
indicate that more items from the catalog are recommended.
Facebook Books (Table [tab:combined_all]). ChatGPT-4 achieves a Gini of 0.1050 and an ItemCV of 1,004, outperforming ChatGPT-3.5 (Gini = 0.0713, ItemCV = 853) on both metrics. Although several baselines, such as ItemKNN (Gini = 0.5293, ItemCV = 2,141), still yield a better diversity score, both ChatGPT-4 and ChatGPT-3.5 generally rank above baselines such as MostPop and EASE$`^R`$. In terms of item coverage and Gini, ChatGPT models demonstrate a high concentration on specific items while covering nearly half of the total span (1,004 out of 2,234 items).
Last.FM (Table [tab:combined_all]). A similar trend emerges: ChatGPT-4 has a higher Gini (0.2023) than ChatGPT-3.5 (0.1927), indicating a lower concentration of recommendations on specific items. Additionally, GPT-4 covers 944 out of 1,507 items, whereas $`\textrm{RP}^3_{\beta}`$, which is designed to trade off diversity and accuracy, achieves a coverage value of 831 and a Gini of 0.1441, demonstrating ChatGPT’s strong ability to recommend diverse items.
MovieLens (Table [tab:combined_all]). ChatGPT-4
achieves a Gini of 0.0853, a slight improvement over ChatGPT-3.5
(0.0851). However, its item coverage spans 553 out of 1,862 items, which
is comparatively lower than approaches such as
$`\textrm{RP}^3_{\beta}`$(Gini = 0.1230, ItemCV = 744). These results
highlight that, although the diversity score is lower than certain
baselines, ChatGPT still presents a comparable diversity score on this
dataset.
Summary (RQ1). ChatGPT’s recommendations are moderately diverse
for Facebook Books and Last.FM, while exhibiting limited diversity on
MovieLens, with GPT‑4 consistently outperforming GPT‑3.5. Although it
does not match the highest-diversity baselines, it shows superior
diversity compared to some CF and CBF approaches.
(RQ2.) Are ChatGPT’s recommendations novel? Novelty is measured via
EPC (Expected Popularity Complement) and EFD (Expected Free Discovery),
both interpreted such that higher values imply more novel
recommendations.
Facebook Books (Table [tab:combined_all]). ChatGPT‑4 exhibits relatively high novelty (EPC=0.0353, EFD=0.3486), exceeding most baselines, including ChatGPT‑3.5 (EPC=0.0250, EFD=0.2480), and even surpassing all CF and CBF algorithms on these metrics.
Last.FM (Table [tab:combined_all]). Both ChatGPT versions rank above average in EPC and EFD, with CF and CBF methods (e.g., $`\textrm{RP}^3_{\beta}`$: EPC=0.2110, EFD=1.9970, VSM: EPC=0.1593, EFD=1.4845) performing at a comparable level. Still, the difference between ChatGPT‑4 (0.1918 EPC, 1.8663 EFD) and ChatGPT‑3.5 (0.1680 EPC, 1.6436 EFD) suggests GPT‑4 more effectively recommends less mainstream items.
MovieLens (Table [tab:combined_all]). On
MovieLens, ChatGPT-4 (0.1453 EPC, 1.6010 EFD) outperforms ChatGPT-3.5
(0.1260 EPC, 1.3981 EFD) in terms of EPC and EFD values and places it on
par with other methods (e.g., NeuMF: 0.1171 EPC, 1.2767 EFD), although
lower than $`\textrm{RP}^3_{\beta}`$(EPC of 0.2421, EFD of 2.6613), the
best model.
Summary (RQ2). ChatGPT’s recommendations exhibit above-average
novelty in MovieLens and high novelty in Facebook Books and Last.FM,
with GPT-4 generally surpassing GPT-3.5. The results suggest that
ChatGPT, based on the user’s history, also recommends novel items for
each user.
(RQ3.) Is ChatGPT affected by popularity bias? We examine popularity
bias using APLT (Average Popularity of Long-Tail Items; higher indicates
stronger inclination toward long-tail (less popular) items) and ARP
(Average Rating-based Popularity; lower values imply less popularity
bias).
Facebook Books (Table [tab:combined_all]). ChatGPT-3.5’s recommendations yield APLT = 0.1870 and ARP = 46, while ChatGPT-4 improves to APLT = 0.2424 and ARP = 40. With a higher APLT and lower ARP, GPT-4 demonstrates a better capability for recommending long-tail and less popular items than GPT-3.5. Although both models remain far from pure MostPop methods (ARP = 138), some baselines, such as AttributeItemKNN (APLT = 0.5879, ARP = 7) and VSM (APLT = 0.5761, ARP = 7), achieve better APLT and ARP values.
Last.FM (Table [tab:combined_all]). ChatGPT-3.5 has an APLT of 0.1391 and an ARP of 99, while GPT-4 has an APLT of 0.1267 and an ARP of 102, positioning it in the mid-range of models. This suggests that GPT-4 covers a smaller percentage of the long tail and tends to recommend more popular items. Although it outperforms certain baselines, such as $`\textrm{RP}^3_{\beta}`$(APLT = 0.0678, ARP = 153), it does not perform as well as other baselines, such as AttributeItemKNN (APLT = 0.3043, ARP = 87.8647).
MovieLens (Table [tab:combined_all]). ChatGPT
shows an ARP of 90 for GPT-3.5 and 95 for GPT-4, which is lower than
MostPop (ARP = 182) but higher than some graph-based methods (e.g.,
LightGCN: ARP = 43) or neighbor-based methods (e.g., AttributeItemKNN:
ARP = 23). This indicates that its behavior is not as popularity-driven
as MostPop but is still influenced by popular items. A similar trend is
observed for APLT, further demonstrating that ChatGPT does not recommend
items from the long tail and exhibits a degree of popularity bias.
Summary (RQ3). Although ChatGPT’s values are far from those
obtained by MostPop, it still exhibits a tendency to recommend popular
items, neglecting items in the long tail. In particular, GPT-4
demonstrates a lower ARP than ChatGPT-3.5, suggesting a tendency to
recommend less popular items.
To conclude, ChatGPT models exhibit strong beyond-accuracy
performance, achieving an optimal balance of novelty and diversity in
the books domain, comparable results in the music domain, and suboptimal
outcomes in the movie domain. Although it shows some inclination toward
popular items, this bias is far less pronounced compared to MostPop or
other strongly popularity-biased baselines. Furthermore, the
improvements observed from GPT-3.5 to GPT-4 across all three datasets
highlight the strength of GPT-4 for recommendations, particularly in
balancing beyond-accuracy trade-offs.
These findings underscore the potential of ChatGPT as a recommender system while also highlighting areas for improvement, particularly in refining its ability to balance relevance, diversity, and novelty across domains.
User Cold-Start Scenario
We now examine user cold‐start recommendations, defined here as
scenarios where each user has provided a maximum of ten interactions.
Table [tab:coldstart_10] details these
results across three datasets, Facebook Books, Last.FM, and MovieLens,
comparing ChatGPT‐3.5 and ChatGPT‐4 to strong Collaborative Filtering
(CF) and Content‐Based Filtering (CBF) baselines. Our central question
is:
RQ4: How effective is ChatGPT in user cold-start scenario across
accuracy and beyond-accuracy dimensions?
Accuracy under Cold‐Start. Despite limited user interactions, ChatGPT exhibits competitive to superior accuracy compared to traditional baselines. For Facebook Books, GPT-4 achieves higher nDCG (0.0538) and Recall (0.0873) than all baselines, including $`\textrm{RP}^3_{\beta}`$ (nDCG = 0.0346) and AttributeItemKNN (0.0335). GPT-3.5 also surpasses these baselines but is slightly behind GPT-4. For Last.FM, ChatGPT maintains robust performance ($`nDCG\geq0.2791`$, $`Recall\geq0.3423`$), outperforming MostPop (nDCG = 0.0529) and random baselines by a wide margin. Although $`\textrm{RP}^3_{\beta}`$ leads in nDCG (0.2389), GPT-4 often excels in Recall and Precision. For MovieLens, GPT-4 attains the highest nDCG (0.1405), surpassing both CF and CBF baselines, while ChatGPT-3.5 (0.1117) also remains competitive. These results underscore ChatGPT’s capacity to identify relevant items effectively from few interactions.
Beyond‐Accuracy in Cold‐Start.
Diversity. GPT‑4 generally surpasses GPT‑3.5 in Gini and item coverage across all three datasets (e.g., increasing from 0.0538 to 0.0846 in Gini on Facebook Books), indicating that GPT‑4’s recommendations span a broader set of items. Although baselines like $`\textrm{RP}^3_{\beta}`$achieve higher coverage in MovieLens and Facebook Books, GPT-4 performs best on Last.FM.
Novelty. ChatGPT’s EPC and EFD values exceed those of CF and CBF baselines across all datasets (e.g., GPT‑4’s EPC = 0.0186 vs. $`\textrm{RP}^3_{\beta}`$= 0.0115 on Facebook Books), implying a tendency to recommend novel items rather than relying on mainstream items.
Popularity Bias. ChatGPT exhibits a moderate inclination toward
popular items compared to baselines across all datasets. Nonetheless, it
remains far from MostPop (e.g., $`ARP\geq139`$ on Facebook Books) but is
comparable to some baselines (e.g., AttributeUserKNN), indicating room
for further mitigation strategies.
In summary, ChatGPT proves highly effective in cold-start
scenarios by: (i) maintaining strong accuracy despite minimal user
interactions, with GPT-4 often outperforming GPT-3.5; (ii) striking a
balance among diversity, novelty, and popularity bias; (iii)
demonstrating consistent improvements over baselines, underscoring
ChatGPT’s capacity to infer user interests with limited interactions.
Conclusion
In this work, we explore the diversity, novelty, and popularity bias of ChatGPT recommendations. Our findings demonstrate that for the Facebook Books, Last.FM, and MovieLens datasets, ChatGPT models exhibit strong beyond-accuracy performance, achieving an optimal balance of novelty and diversity in Facebook Books, comparable results for Last.FM, and suboptimal outcomes for MovieLens.
Additionally, we show that while ChatGPT demonstrates a good balance between novelty and diversity, it also exhibits a tendency to recommend popular items, especially in the MovieLens dataset.
Finally, we extend our exploration to the user cold-start scenario, where ChatGPT proves highly effective by maintaining strong accuracy despite minimal user interactions, balancing diversity, novelty, and popularity bias, and demonstrating consistent improvements over baselines.
These findings underscore the beyond-accuracy capabilities of ChatGPT as a recommender system. Future research will include additional datasets to generalize the findings across domains, as well as experiments comparing ChatGPT with other LLMs such as Gemini, LLaMA, and DeepSeek.
Limitation
Nowadays, LLMs are used to augment the capabilities of recommender systems. However, these models are typically trained on vast internet-scale corpora, which may include portions of open datasets used for benchmarking. Recent work studying memorization in MovieLens-1M shows that models like GPTs and LLaMA-3 can memorize such datasets, with larger models exhibiting higher memorization rates. For example, the reported memorization rate is 12.9% for LLaMA-3.1 405B and 80.76% for GPT-4. Further research should focus on understanding the correlation between improvements in recommendation quality and memorization capacity.
84 natexlab F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems
Handbook, Springer, US, 2022. S. Zhang, L. Yao, A. Sun, Y. Tay, Deep
learning based recommender system: A survey and new perspectives,
Comput. Surv. 52 (2019) 5:1–5:38. J. Chung, Ç. Gülçehre, K. Cho,
Y. Bengio, Empirical evaluation of gated recurrent neural networks on
sequence modeling, CoRR abs/1412.3555 (2014). URL:
http://arxiv.org/abs/1412.3555. arXiv:1412.3555. A. Gu, T. Dao,
Mamba: Linear-time sequence modeling with selective state spaces, CoRR
abs/2312.00752 (2023). URL: https://doi.org/10.48550/arXiv.2312.00752.
doi:10.48550/ARXIV.2312.00752. arXiv:2312.00752. J. Devlin, M. Chang,
K. Lee, K. Toutanova, pre-training of deep bidirectional transformers
for language understanding, in: NAACL-HLT (1), 2019, pp. 4171–4186.
A. D. Bellis, V. W. Anelli, T. D. Noia, E. D. Sciascio, prompt-based
detection of semantic containment patterns in mlms, in: G. Demartini,
K. Hose, M. Acosta, M. Palmonari, G. Cheng, H. Skaf-Molli, N. Ferranti,
D. Hernández, A. Hogan (Eds.), The Semantic Web - ISWC 2024 - 23rd
International Semantic Web Conference, Baltimore, MD, USA, November
11-15, 2024, Proceedings, Part II, volume 15232 of Lecture Notes in
Computer Science, Springer, 2024, pp. 227–246. URL:
https://doi.org/10.1007/978-3-031-77850-6_13.
doi:10.1007/978-3-031-77850-6_13. G. Servedio, A. De Bellis,
D. Di Palma, V. W. Anelli, T. Di Noia, Are the hidden states hiding
something? testing the limits of factuality-encoding capabilities in
llms, arXiv preprint arXiv:2505.16520 (2025). D. Di Palma, A. De Bellis,
G. Servedio, V. W. Anelli, F. Narducci, T. Di Noia, Llamas have feelings
too: Unveiling sentiment and emotion representations in llama models
through probing, arXiv preprint arXiv:2505.16491 (2025). P. Aghilar,
V. W. Anelli, M. Trizio, E. Di Sciascio, T. Di Noia, Training-free,
identity-preserving image editing for fashion pose alignment and
normalization, Expert Systems with Applications 293 (2025a) 128579.
doi:https://doi.org/10.1016/j.eswa.2025.128579. P. Aghilar, V. W.
Anelli, A. Lops, F. Narducci, A. Ragone, S. Roccotelli, M. Trizio,
Adaptive user modeling in visual merchandising: Balancing brand identity
with operational efficiency, in: Proceedings of the 33rd ACM Conference
on User Modeling, Adaptation and Personalization, UMAP 2025, New York
City, NY, USA, June 16-19, 2025, ACM, 2025b, pp. 358–360. URL:
https://doi.org/10.1145/3699682.3730976. doi:10.1145/3699682.3730976.
Y. Ping, Y. Li, J. Zhu, Beyond accuracy measures: the effect of
diversity, novelty and serendipity in recommender systems on user
engagement, Electronic Commerce Research (2024) 1–28. T. Duricic,
D. Kowald, E. Lacic, E. Lex, Beyond-accuracy: a review on diversity,
serendipity, and fairness in recommender systems based on graph neural
networks, Frontiers Big Data 6 (2024). S. Karimi, H. A. Rahmani,
M. Naghiaei, L. Safari, Provider fairness and beyond-accuracy trade-offs
in recommender systems, CoRR abs/2309.04250 (2023). M. Attimonelli,
A. D. Bellis, C. Pomo, D. Jannach, E. D. Sciascio, T. D. Noia, Do we
really need specialization? evaluating generalist text embeddings for
zero-shot recommendation and search, in: RecSys, ACM, 2025. URL:
https://doi.org/10.1145/3705328.3748040. doi:10.1145/3705328.3748040.
D. Di Palma, G. Servedio, V. W. Anelli, G. M. Biancofiore, F. Narducci,
L. Carnimeo, T. D. Noia, Beyond words: Can chatgpt support
state-of-the-art recommender systems?, in: IIR, volume 3802 of CEUR
Workshop Proceedings, CEUR-WS.org, 2024, pp. 13–22. M. Valentini,
Cooperative and competitive llm-based multi-agent systems for
recommendation, in: C. Hauff, C. Macdonald, D. Jannach, G. Kazai, F. M.
Nardini, F. Pinelli, F. Silvestri, N. Tonellotto (Eds.), Advances in
Information Retrieval - 47th European Conference on Information
Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part
V, volume 15576 of Lecture Notes in Computer Science, Springer, 2025,
pp. 204–211. Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley,
W. X. Zhao, Large language models are zero-shot rankers for recommender
systems, in: ECIR (2), volume 14609 of Lecture Notes in Computer
Science, Springer, 2024, pp. 364–381. D. Di Palma, Retrieval-augmented
recommender system: Enhancing recommender systems with large language
models, in: RecSys, ACM, 2023, pp. 1369–1373. S. Dai, N. Shao, H. Zhao,
W. Yu, Z. Si, C. Xu, Z. Sun, X. Zhang, J. Xu, Uncovering chatgpt’s
capabilities in recommender systems, in: RecSys, ACM, 2023, pp.
1126–1132. M. Attimonelli, D. Danese, D. Malitesta, C. Pomo, G. Gassi,
T. D. Noia, Ducho 2.0: Towards a more up-to-date unified framework for
the extraction of multimodal features in recommendation, in: WWW
(Companion Volume), ACM, 2024, pp. 1075–1078. J. Liu, C. Liu, R. Lv,
K. Zhou, Y. Zhang, Is chatgpt a good recommender? A preliminary study,
CoRR abs/2304.10149 (2023). D. Carraro, D. Bridge, Enhancing
recommendation diversity by re-ranking with large language models, ACM
Trans. Recomm. Syst. (2024). URL: https://doi.org/10.1145/3700604.
doi:10.1145/3700604. Y. Tokutake, K. Okamoto, Can large language models
assess serendipity in recommender systems?, J. Adv. Comput. Intell.
Intell. Informatics 28 (2024) 1263–1272. D. Di Palma, G. M. Biancofiore,
V. W. Anelli, F. Narducci, T. D. Noia, Content-based or collaborative?
insights from inter-list similarity analysis of chatgpt recommendations,
in: UMAP (Adjunct Publication), ACM, 2025, pp. 28–33. Y. Deldjoo,
Understanding biases in chatgpt-based recommender systems: Provider
fairness, temporal stability, and recency, ACM Trans. Recomm. Syst.
(2024). URL: https://doi.org/10.1145/3690655. doi:10.1145/3690655.
J. Zhang, K. Bao, Y. Zhang, W. Wang, F. Feng, X. He, Is chatgpt fair for
recommendation? evaluating fairness in large language model
recommendation, in: RecSys, 2023, pp. 993–999. A. C. M. Mancino,
A. Ferrara, S. Bufi, D. Malitesta, T. D. Noia, E. D. Sciascio, Kgtore:
Tailored recommendations through knowledge-aware GNN models, in: RecSys,
2023, pp. 576–587. S. Bufi, A. C. M. Mancino, A. Ferrara, D. Malitesta,
T. D. Noia, E. D. Sciascio, simple knowledge-aware graph-based
recommender with user-based semantic features filtering, in: IRonGraphs,
volume 2197 of Communications in Computer and Information Science,
Springer, 2024, pp. 41–59. F. M. Harper, J. A. Konstan, The movielens
datasets: History and context, Trans. Interact. Intell. Syst. 5 (2016)
19:1–19:19. I. Cantador, P. Brusilovsky, T. Kuflik, Second workshop on
information heterogeneity and fusion in recommender systems
(hetrec2011), in: RecSys, ACM, New York, NY, USA, 2011, pp. 387–388.
G. M. Biancofiore, D. Di Palma, C. Pomo, F. Narducci, T. Di Noia,
Conversational user interfaces and agents, in: Human-Centered AI: An
Illustrated Scientific Quest, Springer, 2025, pp. 399–438. J. L.
Herlocker, J. A. Konstan, L. G. Terveen, J. Riedl, Evaluating
collaborative filtering recommender systems, Trans. Inf. Syst. 22 (2004)
5–53. T. Silveira, M. Zhang, X. Lin, Y. Liu, S. Ma, How good your
recommender system is? A survey on evaluations in recommendation, Int.
J. Mach. Learn. Cybern. 10 (2019) 813–831. M. Karimi, D. Jannach,
M. Jugovac, News recommender systems - survey and roads ahead, Inf.
Process. Manag. 54 (2018) 1203–1227. A. Gunawardana, G. Shani, S. Yogev,
Evaluating recommender systems, in: Recommender Systems Handbook,
Springer, US, 2022, pp. 547–601. M. Kaminskas, D. Bridge, Diversity,
serendipity, novelty, and coverage: A survey and empirical analysis of
beyond-accuracy objectives in recommender systems, Trans. Interact.
Intell. Syst. 7 (2017) 2:1–2:42. P. Cheng, S. Wang, J. Ma, J. Sun,
H. Xiong, Learning to recommend accurate and diverse items, in: WWW,
ACM, 2017, pp. 183–192. W. Wu, L. Chen, Y. Zhao, Personalizing
recommendation diversity based on user personality, User Model. User
Adapt. Interact. 28 (2018) 237–276. M. Nakatsuji, Y. Fujiwara,
A. Tanaka, T. Uchiyama, K. Fujimura, T. Ishida, Classical music for rock
fans?: novel recommendations for expanding user interests, in: CIKM,
ACM, 2010, pp. 949–958. M. Cai, L. Chen, Y. Wang, H. Bai, P. Sun, L. Wu,
M. Zhang, M. Wang, Popularity-aware alignment and contrast for
mitigating popularity bias, in: KDD, ACM, 2024, pp. 187–198.
V. Paparella, D. Di Palma, V. W. Anelli, T. D. Noia, Broadening the
scope: Evaluating the potential of recommender systems beyond
prioritizing accuracy, in: RecSys, ACM, 2023, pp. 1139–1145. D. Jannach,
L. Lerche, I. Kamehkhosh, M. Jugovac, What recommenders recommend: an
analysis of recommendation biases and possible countermeasures, User
Model. User Adapt. Interact. 25 (2015) 427–491. G. Adomavicius,
J. Zhang, Impact of data characteristics on recommender systems
performance, Trans. Manag. Inf. Syst. 3 (2012) 3:1–3:17. S. Vargas,
P. Castells, Rank and relevance in novelty and diversity metrics for
recommender systems, in: RecSys, ACM, 2011, pp. 109–116.
H. Abdollahpouri, R. Burke, B. Mobasher, Managing popularity bias in
recommender systems with personalized re-ranking, in: FLAIRS, AAAI
Press, 2019, pp. 413–418. Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang,
J. Zhang, Chat-rec: Towards interactive and explainable llms-augmented
recommender system, CoRR abs/2303.14524 (2023). A. Manzoor, S. C.
Ziegler, K. M. P. Garcia, D. Jannach, Chatgpt as a conversational
recommender system: A user-centric analysis, in: UMAP, ACM, 2024, pp.
267–272. S. Sanner, K. Balog, F. Radlinski, B. Wedin, L. Dixon, Large
language models are competitive near cold-start recommenders for
language- and item-based preferences, in: RecSys, 2023, pp. 890–896.
Z. Li, Y. Chen, X. Zhang, X. Liang, Bookgpt: A general framework for
book recommendation empowered by large language model, Electronics 12
(2023) 4654. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M.
Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford,
I. Sutskever, D. Amodei, Language models are few-shot learners, in:
NeurIPS, 2020. A. Kong, S. Zhao, H. Chen, Q. Li, Y. Qin, R. Sun,
X. Zhou, E. Wang, X. Dong, Better zero-shot reasoning with role-play
prompting, in: NAACL-HLT, Association for Computational Linguistics,
2024, pp. 4099–4113. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo,
Y. Iwasawa, Large language models are zero-shot reasoners, CoRR
abs/2205.11916 (2022). J. Wei, X. Wang, D. Schuurmans, M. Bosma,
B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain-of-thought
prompting elicits reasoning in large language models, in: NeurIPS, 2022.
S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, K. Narasimhan,
Tree of thoughts: Deliberate problem solving with large language models,
in: NeurIPS, 2023. N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan,
S. Yao, Reflexion: language agents with verbal reinforcement learning,
in: NeurIPS, 2023. Z. Liu, X. Yu, Y. Fang, X. Zhang, Graphprompt:
Unifying pre-training and downstream tasks for graph neural networks,
in: WWW, ACM, 2023, pp. 417–428. P. Sahoo, A. K. Singh, S. Saha,
V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt engineering
in large language models: Techniques and applications, CoRR
abs/2402.07927 (2024). L. Xu, J. Zhang, B. Li, J. Wang, M. Cai, W. X.
Zhao, J. Wen, Prompting large language models for recommender systems: A
comprehensive framework and empirical analysis, CoRR abs/2401.04997
(2024). A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever,
et al., Language models are unsupervised multitask learners, OpenAI blog
1 (2019) 9. J. Jin, X. Chen, F. Ye, M. Yang, Y. Feng, W. Zhang, Y. Yu,
J. Wang, Lending interaction wings to recommender systems with
conversational agents, in: NeurIPS, 2023. A. Kong, S. Zhao, H. Chen,
Q. Li, Y. Qin, R. Sun, X. Zhou, Better zero-shot reasoning with
role-play prompting, CoRR abs/2308.07702 (2023). A. C. M. Mancino,
S. Bufi, A. di Fazio, A. Ferrara, D. Malitesta, C. Pomo, T. D. Noia,
Datarec: A python library for standardized and reproducible data
management in recommender systems, in: Proceedings of the 48th
International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR 2025, Padua, Italy July 13-18, 2025, ACM,
2025. URL: https://doi.org/10.1145/3726302.3730320.
doi:10.1145/3726302.3730320. V. Paparella, V. W. Anelli, F. M. Nardini,
R. Perego, T. D. Noia, Post-hoc selection of pareto-optimal solutions in
search and recommendation, in: CIKM, ACM, 2023, pp. 2013–2023. V. W.
Anelli, A. Bellogı́n, A. Ferrara, D. Malitesta, F. A. Merra, C. Pomo,
F. M. Donini, T. D. Noia, Elliot: A comprehensive and rigorous framework
for reproducible recommender systems evaluation, in: SIGIR, ACM, New
York, NY, USA, 2021, pp. 2405–2414. A. Ferrara, V. W. Anelli, A. C. M.
Mancino, T. D. Noia, E. D. Sciascio, Kgflex: Efficient recommendation
with sparse feature factorization and knowledge graphs, ACM Trans.
Recomm. Syst. (2023). U. Javed, K. Shaukat, I. A. Hameed, F. Iqbal,
T. M. Alam, S. Luo, A review of content-based and context-based
recommendation systems, International Journal of Emerging Technologies
in Learning (iJET) 16 (2021) 274–306. B. Paudel, F. Christoffel,
C. Newell, A. Bernstein, Updatable, accurate, diverse, and scalable
recommendations for interactive applications, Trans. Interact. Intell.
Syst. 7 (2017) 1:1–1:34. X. He, K. Deng, X. Wang, Y. Li, Y. Zhang,
M. Wang, Lightgcn: Simplifying and powering graph convolution network
for recommendation, in: SIGIR, 2020, pp. 639–648. C. Cooper, S. Lee,
T. Radzik, Y. Siantos, Random walks in recommender systems: exact
computation and simulations, in: WWW (Companion Volume), 2014, pp.
811–816. B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, Analysis of
recommendation algorithms for e-commerce, in: EC, ACM, New York, NY,
USA, 2000, pp. 158–167. J. S. Breese, D. Heckerman, C. M. Kadie,
Empirical analysis of predictive algorithms for collaborative filtering,
in: UAI, 1998, pp. 43–52. H. Steck, Embarrassingly shallow autoencoders
for sparse data, in: WWW, ACM, New York, NY, USA, 2019, pp. 3251–3257.
S. Rendle, W. Krichene, L. Zhang, J. R. Anderson, Neural collaborative
filtering vs. matrix factorization revisited, in: RecSys, 2020, pp.
240–248. X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T. Chua, Neural
collaborative filtering, in: WWW, 2017, pp. 173–182. G. Salton, A. Wong,
C. Yang, A vector space model for automatic indexing, Commun. ACM 18
(1975) 613–620. Z. Gantner, S. Rendle, C. Freudenthaler,
L. Schmidt-Thieme, Mymedialite: a free recommender system library, in:
RecSys, ACM, New York, NY, USA, 2011, pp. 305–308. Y. Chen, Q. Fu,
Y. Yuan, Z. Wen, G. Fan, D. Liu, D. Zhang, Z. Li, Y. Xiao, Hallucination
detection: Robustly discerning reliable answers in large language
models, in: CIKM, 2023, pp. 245–255. F. Nie, J. Yao, J. Wang, R. Pan,
C. Lin, A simple recipe towards reducing hallucination in neural surface
realisation, in: ACL (1), 2019, pp. 2673–2679. Z. Ji, N. Lee,
R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, P. Fung,
Survey of hallucination in natural language generation, Comput. Surv. 55
(2023) 248:1–248:38. V. E. Giuliano, P. E. J. Jr., G. E. Kimball, R. F.
Meyer, B. A. Stein, Automatic pattern recognition by a gestalt method,
Inf. Control. 4 (1961) 332–345. A. V. Petrov, C. MacDonald, gsasrec:
Reducing overconfidence in sequential recommendation trained with
negative sampling, in: RecSys, 2023, pp. 116–128. D. L. Olson, D. Delen,
Advanced Data Mining Techniques, Springer, US, 2008. K. Järvelin,
J. Kekäläinen, Cumulated gain-based evaluation of IR techniques, Trans.
Inf. Syst. 20 (2002) 422–446. D. Di Palma, F. A. Merra, M. Sfilio, V. W.
Anelli, F. Narducci, T. Di Noia, Do llms memorize recommendation
datasets? a preliminary study on movielens-1m, in: Proceedings of the
48th International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR 2025, Padua, Italy July 13-18, 2025, ACM,
2025. URL: https://doi.org/10.1145/3726302.3730178.
doi:10.1145/3726302.3730178.
A Note of Gratitude
The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.-
While we acknowledge that diversity can be measured in multiple ways, such as Top-N diversity (user-level variation in recommendation lists) or temporal diversity (diversity over time), we focus on aggregate diversity due to its measurable implications for item exposure, and long-tail promotion. ↩︎
-
https://2015.eswc-conferences.org/program/semwebeval.html ↩︎