Exploring Diversity, Novelty, and Popularity Bias in ChatGPT s Recommendations

February 04, 2026

Reading time: 34 minute

...

#paper #research

📝 Original Paper Info

- Title: Exploring Diversity, Novelty, and Popularity Bias in ChatGPT s Recommendations
- ArXiv ID: 2601.01997
- Date: 2026-01-05
- Authors: Dario Di Palma, Giovanni Maria Biancofiore, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia

📝 Abstract

ChatGPT has emerged as a versatile tool, demonstrating capabilities across diverse domains. Given these successes, the Recommender Systems (RSs) community has begun investigating its applications within recommendation scenarios primarily focusing on accuracy. While the integration of ChatGPT into RSs has garnered significant attention, a comprehensive analysis of its performance across various dimensions remains largely unexplored. Specifically, the capabilities of providing diverse and novel recommendations or exploring potential biases such as popularity bias have not been thoroughly examined. As the use of these models continues to expand, understanding these aspects is crucial for enhancing user satisfaction and achieving long-term personalization. This study investigates the recommendations provided by ChatGPT-3.5 and ChatGPT-4 by assessing ChatGPT's capabilities in terms of diversity, novelty, and popularity bias. We evaluate these models on three distinct datasets and assess their performance in Top-N recommendation and cold-start scenarios. The findings reveal that ChatGPT-4 matches or surpasses traditional recommenders, demonstrating the ability to balance novelty and diversity in recommendations. Furthermore, in the cold-start scenario, ChatGPT models exhibit superior performance in both accuracy and novelty, suggesting they can be particularly beneficial for new users. This research highlights the strengths and limitations of ChatGPT's recommendations, offering new perspectives on the capacity of these models to provide recommendations beyond accuracy-focused metrics.

💡 Summary & Analysis

1. **Principle Explanation**: - **Basic**: Recommender systems predict user preferences and deliver personalized content to help find valuable information. ChatGPT aids in generating better recommendations. - **Intermediate**: While traditional recommender systems focus on accuracy, ChatGPT excels in various aspects such as diversity, novelty, and popularity bias, offering users a wider range of experiences. - **Advanced**: As a Large Language Model (LLM), ChatGPT is proficient at text generation. Integrating it with recommendation systems enables broader and more nuanced explanations.

Research Methodology:
- Basic: The research team evaluated the diversity, novelty, and popularity bias of recommendations using ChatGPT across various datasets.
- Intermediate: They developed a methodology utilizing role-playing prompts which reduces duplicate recommendations and provides better results.
- Advanced: Prompt designs were inspired by GPT-3 success cases and various prompt techniques were used to optimize model performance.
Significance of Research:
- Basic: This research helps understand how ChatGPT can be integrated into recommendation systems.
- Intermediate: Evaluations using diverse datasets demonstrate that ChatGPT could provide better user experiences.
- Advanced: The study analyzes how ChatGPT operates beyond accuracy in terms of diversity, novelty, and popularity bias across different dimensions.

📄 Full Paper Content (ArXiv Source)

\[ email=d.dipalma2@phd.poliba.it, \]

[ email=giovannimaria.biancofiore@poliba.it, ]

[ email=vitowalter.anelli@poliba.it, ]

[ email=fedelucio.narducci@poliba.it, ]

[ email=tommaso.dinoia@poliba.it, ]

ChatGPT , Recommender Systems (RSs) , Large Language Models (LLMs) , Diversity , Novelty , Popularity Bias , Cold-Start

Introduction

Recommender systems (RSs) have long assisted users in discovering valuable information on the web by predicting their preferences and delivering personalized content. Over time, these systems have evolved from Matrix Factorization approaches to modern architectures that extend state-of-the-art Deep Learning models , originally developed for other domains such as time-series forecasting , natural language processing , and computer vision . Despite significant progress in improving accuracy, current research in the user modeling and personalization community increasingly emphasizes the importance of beyond-accuracy perspectives such as diversity, novelty, and popularity bias. These factors not only impact overall system effectiveness but also influence user satisfaction , long-term engagement , and fairness .

With the release of ChatGPT in November 2022 , Large Language Models (LLMs) have begun to reshape how recommendations can be delivered. Unlike traditional RSs that rely on carefully structured training data, LLMs can generate free-form text, potentially offering more nuanced explanations and broader item coverage by leveraging their vast knowledge. Consequently, the research community is now experimenting with LLM-driven recommendation pipelines , demonstrating notable successes in improving recommendation accuracy . However, most existing studies on ChatGPT-based recommender systems have emphasized improving accuracy while neglecting the beyond-accuracy dimensions that are critical for real-world impact .

Ignoring the beyond-accuracy behavior of ChatGPT creates a black box for researchers, making it difficult to determine whether it over-recommends popular items, reduces novelty, or offers less diverse recommendations, all of which may negatively impact user satisfaction and long-term personalization goals. Early investigations focused on how to use ChatGPT for re-ranking recommendations , while others began to study the serendipity of the generated recommendations or explored how ChatGPT generates recommendations and whether its outputs align more closely with content-based or collaborative filtering approaches . Only investigates biases in ChatGPT-based recommender systems, with a specific focus on provider fairness. While a few works have begun examining potential biases related to sensitive attributes such as race, gender, and religion , aspects like recommendation diversity, novelty, and popularity bias in ChatGPT remain largely unexplored. Addressing these gaps is essential to ensure that personalization technologies are both effective and fair.

To this end, we analyze ChatGPT’s recommendation behavior, focusing on both ChatGPT-3.5 and ChatGPT-4 across multiple beyond-accuracy metrics. Specifically, we investigate whether ChatGPT generates diverse and novel recommendations or exhibits popularity bias, both under normal conditions and in user cold-start scenarios where users have interacted with only a few items. Our evaluation spans three distinct domains, Books, Movies, and Music, using the Facebook Books , MovieLens , and Last.FM datasets as benchmarks, aiming to answer the following Research Questions (RQs):

Are ChatGPT’s recommendations diverse?
Are ChatGPT’s recommendations novel?
Is ChatGPT affected by popularity bias?
How effective is ChatGPT in user cold-start scenario across accuracy and beyond-accuracy dimensions?

Diversity, Novelty, and Popularity Bias in Recommender Systems. Driven by the need for Recommender Systems (RSs) to enhance user engagement , this work focuses on beyond-accuracy measures of RSs, namely, diversity, novelty, and popularity bias, to investigate how these factors affect the recommendation lists provided by ChatGPT.

There was a moment in the evolution of RSs when researchers realized that evaluating recommendations solely based on accuracy metrics was insufficient. For instance, suggest that the performance of recommendations should be measured by their usefulness to the user. Similarly , in their survey on the evaluation of RSs, suggest that recommendations can be evaluated based on utility, novelty, diversity, unexpectedness, serendipity, and coverage. Karimi, Jannach, and Jugovac , in their review of state-of-the-art news RSs, identify diversity, novelty, and popularity as the most common quality factors for improving recommendations. Specifically, diversity and novelty are often considered quality factors that must be balanced with prediction accuracy , and the most discussed beyond-accuracy objectives in recommender system research .

As interest in studying RSs beyond accuracy metrics spread, more studies began to use these metrics as goals for improvement. For example, and focused on creating RSs that not only predict accurate items but also achieve a high level of diversity in recommendations. employed a graph-based approach to identify items with higher novelty, while proposed a method to mitigate popularity bias. Furthermore, the work by emphasizes the importance of evaluating RSs beyond accuracy, proposing a multi-objective evaluation approach.

Although many works use beyond-accuracy metrics to evaluate and improve RSs, prior literature lacks a unified framework that rigorously defines diversity, novelty, and bias, leading to vagueness and overlap among these measures. In our study, we define these concepts as follows: Diversity is the extent to which a recommender system suggests a wide range of items from the catalog ¹, as supported by and . Novelty is the degree to which recommended items expose users to relevant experiences they are unlikely to discover independently, based on . Popularity Bias refers to the tendency of recommender systems to favor popular items, those with many interactions, over less popular or niche items, aligning with and .
ChatGPT-based recommendation. A first example of ChatGPT for recommendation is proposed by , who introduced ChatREC, a ChatGPT-augmented recommender system that translates the recommendation task into an interactive conversation with users. The authors proposed a prompt template to convert user information and user-item interactions into a query for ChatGPT. However, the system was evaluated solely using accuracy metrics (i.e., Recall, Precision, nDCG). investigates ChatGPT’s performance in a multi-turn conversational recommendation setting, demonstrating its potential as a conversational recommender and showing that it outperforms traditional methods. focused their work on ChatGPT in zero-shot settings. They analyzed ChatGPT models by designing a dedicated prompting template and revealed that ChatGPT-4 achieved the highest ranking performance compared to other LLMs in the zero-shot recommendation task.

investigated the abilities of ChatGPT as a recommender systems for the Top-N recommendation task, aiming to identify the most effective prompting strategy for producing relevant recommendations. The authors concluded that the zero-shot setting yields the most relevant recommendation list, outperforming content-based baselines. However, their conclusions were based solely on nDCG as the evaluation metric, which limits the findings to only one dimension of RSs.

investigate ChatGPT’s abilities in suggesting items through rating prediction, pairwise recommendation, and re-ranking strategies using the prompting approach. Their experiments, conducted on four domains, demonstrate ChatGPT’s abilities to recommend items. Nonetheless, this study provide only an accuracy view of ChatGPTs’ capabilities in the Top-N recommendation.

focus on applying ChatGPT within the book recommendation scenario, designing BookGPT to address single-item and rating prediction tasks. However, the study does not provide a generalizable analysis of ChatGPT’s performance across multiple domains, as the authors focus only on the book domain.

Although all the presented works focus on using ChatGPT to improve the performance of recommender systems, they are primarily based on accuracy metrics. To address this gap, our work investigates the task of Top-N recommendation, moving beyond accuracy by evaluating ChatGPT’s performance in terms of diversity, novelty, and popularity bias, while also highlighting its beyond-accuracy capabilities in user cold-start scenarios.

Methodology

The following sections discuss the methodology used in our research, outline the design of the prompts employed to collect recommendations from ChatGPT, detail the datasets used in the experiments, present the baselines for comparison, and list the metrics used to assess diversity, novelty, and popularity bias.

Prompt Design

The introduction of GPT-3 demonstrated the ability of LLMs to perform diverse tasks when provided with clear, task-specific prompts, showing how prompts condition the model’s response and play a critical role in shaping its performance on a given task .

With the widespread diffusion of ChatGPT, the literature on prompt engineering has expanded, moving from basic prompts such as zero- and few-shot to more complex prompts like Chain-of-Thought , Tree-of-Thoughts , Reflexion , or Graph-Prompting . Among the various prompt techniques , we hand-engineered Zero-Shot, Few-Shot, Chain-of-Thought, and Role-Playing (RP) prompting following the works of and , to identify the best approach for our investigation.

In the following, we present the hand-engineered prompts and explain the main reasons for selecting RP prompting as the primary technique for our investigation. Specifically, for all the tested prompts and for each user, the input consists of the user’s history, presented as a list of items formatted as follows: $`\{History\ of\ the\ User\}: Item_1, Item_2, \ldots, Item_{N}`$.

Zero-shot prompting . In zero-shot prompting, we directly provided the user’s history to ChatGPT and asked for 50 recommendations, as shown in the reference example (see fig. 1). However, $`\sim`$71% of the generated lists contained fewer than 50 items or included repeated entries, and $`\sim`$6% exhibited incorrect task execution.

Example of a zero-shot prompt designed to generate recommendations based on the user’s history.

Few-shot prompting . After zero-shot prompting, we tested few-shot prompting by providing a few demonstrations of recommendations to help the LLM better understand the task (see Fig. 2). While these contextual examples reduced execution errors, $`\sim`$44% of the generated lists contained duplicate items.

Example of a few-shot prompt illustrating recommendations with explanations based on the user’s watched movies and ratings.

Chain-of-Thought (CoT) prompting . Using CoT, we attempted to break the recommendation task into explicit steps to force ChatGPT to reason step-by-step. As shown in Fig. 3, we explicitly defined the instructions, the user’s preferences, and the steps to identify the most suitable recommendations. This approach produced excessive tokens, reaching the context limit after generating $`\sim`$26 items.

Example of a Chain-of-Thought prompt for book recommendation, incorporating user preferences and step-by-step reasoning to arrive at a recommendation.

Role-Playing prompting . Following the work of and , we also tested Role-Playing prompts, where ChatGPT impersonates a Recommender System and recommends items based on the user’s history (see Fig. 4). This strategy proved the most effective, eliminating duplicate recommendations.

Example of a Role-Playing Recommender prompt designed to generate a ranked list of 50 recommendations based on the user’s history.

After testing 30 hand-crafted prompts and aligning with studies on Role-Playing Prompting , we selected this approach for its ability to reduce duplicates and token usage. In this setup, ChatGPT acts as a Recommender System, generating 50 recommendations based on the user’s history (see Fig. 4).

Experimental Setup

This section outlines the experimental setup, including the datasets, baselines, and metrics used to assess the beyond-accuracy performance of ChatGPT’s recommendations, with a focus on diversity, novelty, and popularity bias.

Datasets. We evaluated ChatGPT on three well-known recommendation datasets, namely MovieLens100k , Last.FM , and Facebook Books ². To enhance data quality, we applied an iterative 10-core filtering strategy , retaining only users and items with at least ten interactions. Table 1 holds the dataset statistics after preprocessing.

Dataset	Interaction	Users	Items	Sparsity	Content
MovieLens	42,456	603	1,862	96.22%	genre
Last.FM	49,171	1,797	1,507	98.18%	genre
FB Books	13,117	1,398	2,234	99.58%	genre, author

Dataset statistics after pre-processing with .

Baseline Models. To measure the effectiveness of ChatGPT, we experimentally compare its performance with state-of-the-art baselines from three categories: Non-Personalized, Collaborative Filtering, and Content-Based Filtering methods. To ensure a fair comparison, we train the baselines and optimize their hyperparameters using the Elliot framework , and split the dataset into 80% training and 20% test sets, following the all unrated items evaluation protocol . The code used for the experiments is publicly available at: https://github.com/sisinflab/beyond-accuracy-recsys-chatgpt . Below, we describe the baselines, grouped by recommendation category. Non-Personalized. Random and Most Popular return random recommendations and the most popular recommendations, respectively, and are used as reference points. Collaborative Filtering. To compare the effect of ChatGPT recommendations on beyond-accuracy metrics, we selected the following collaborative filtering methods, each focusing on different aspects. Specifically, we selected $`\textrm{RP}^3_{\beta}`$ and LightGCN for their demonstrated ability to maintain accuracy while preserving diversity . ItemKNN , UserKNN , and EASE$`^R`$ were chosen for their emphasis on relevance and personalization . Finally, MF2020 and NeuMF were included as a tradeoff between model complexity and effectiveness. Content-Based Filtering. We further extend our comparison by including content-based models, which prioritize explicit feature representations and offer a meaningful contrast to collaborative models. This allows us to evaluate ChatGPT against the most appropriate model type for the dataset. Specifically, we include VSM which represents items as vectors in a high-dimensional space, with each dimension corresponding to a feature, as well as AttributeItemKNN and AttributeUserKNN , which rely on TF-IDF-weighted attribute vectors to compute similarities and generate recommendations.
Ensuring Recommendation Consistency. ChatGPT models generate recommendations based solely on the user profile provided in the prompt, without being constrained to a predefined dataset. As a result, they may hallucinate or suggest real items not present in the reference dataset, leading to discrepancies in item names and inconsistencies in evaluation.

To address this, we adopt a post-processing pipeline that uses Gestalt pattern matching to identify the closest match in the dataset, accepting items with a similarity score above 90% (empirically determined). Unmatched items are flagged as External Items, originating from the LLM’s pre-trained knowledge, and excluded from evaluation to ensure a fair comparison with traditional recommenders by selecting in-catalogue items.

Since this final step could affect our evaluations, we verified that out-of-catalogue items consistently appeared beyond the top-10 positions in all recommendation lists, ensuring that rank-sensitive metrics remain unaffected and preserving the validity of our evaluation. In our configuration, ChatGPT placed these items only after rank 23, suggesting 2,740 out-of-catalogue items for Books, 870 for Music, and 234 for Movies.

Finally, to ensure a fair comparison, we evaluate all models and ChatGPT results at a cutoff of 10 (i.e., Top-10 recommendations per user), following widely accepted practices in recommendation .
Evaluation Metrics. While our primary focus is on the beyond-accuracy aspects of ChatGPT’s recommendations, it is also important to include accuracy metrics to assess whether the recommendations are relevant to users. For this purpose, we use two standard metrics: Precision and Recall . Higher values of Precision and Recall indicate that the recommender system provides a greater number of relevant items. Additionally, we evaluate the ranking quality of the recommendations using the Normalized Discounted Cumulative Gain (nDCG) , where higher values indicate better recommendation lists.

For beyond-accuracy metrics, we selected a set of measures to evaluate diversity, novelty, and popularity bias. The specific metrics considered are detailed in Table [tab:beyond_accuracy_table].

Experimental Results

ChatGPT Beyond-Accuracy Recommendation Performance

In this section, we discuss the empirical findings from Table [tab:combined_all], focusing on (RQ1.) the diversity of ChatGPT’s recommendations, (RQ2.) their novelty, and (RQ3.) the extent to which ChatGPT is affected by popularity bias. The evaluation comprises three datasets, Facebook Books, Last.FM, and MovieLens, and compares ChatGPT‑3.5 and ChatGPT‑4 against both Collaborative Filtering and Content‑Based Filtering baselines. Statistically significant differences (paired t‑tests at $`p<0.05`$) are noted where indicated in the table.
Preliminary Accuracy Analysis. Before examining diversity, novelty, and popularity bias, we first verify that ChatGPT’s recommendations fulfill the primary goal of offering relevant items. We use nDCG, Recall, and Precision as standard accuracy metrics. Higher values on these metrics imply better recommendation.

Overall, ChatGPT demonstrates a comparable level of accuracy in recommendation scenarios. Specifically, on Facebook Books, ChatGPT-4 attains the highest nDCG overall (0.0932), significantly outperforming the best baseline, AttributeItemKNN (0.0479), as well as ChatGPT-3.5 (0.0668). Recall and Precision follow a similar pattern to nDCG.

For Last.FM, while ChatGPT-4 (nDCG = 0.2832) does not surpass the best Collaborative Filtering (CF) approach ($`\textrm{RP}^3_{\beta}`$: 0.3147), it still ranks among the top-performing algorithms. ChatGPT-3.5 trails behind ChatGPT-4 but still outperforms some baselines (e.g., EASE$`^R`$, AttributeItemKNN).

For MovieLens, although ChatGPT-4 improves upon ChatGPT-3.5 across all accuracy metrics, raising nDCG from 0.1475 to 0.1815 and Precision from 0.1120 to 0.1551, certain CF algorithms (e.g., $`\textrm{RP}^3_{\beta}`$: 0.2827 nDCG, 0.2708 Precision) achieve significantly higher scores. Nonetheless, ChatGPT’s accuracy levels comfortably exceed those of some methods, such as VSM (0.0174 nDCG) and AttributeItemKNN (0.0326 nDCG).

These results demonstrate that both ChatGPT-3.5 and ChatGPT-4 achieve valid and reasonable performance on accuracy metrics. This preliminary evaluation ensures that the subsequent analysis of diversity, novelty, and popularity bias is based on recommendations that already meet the accuracy standard. In the following sections, our analysis is divided according to the research questions (RQs).
(RQ1.) Are ChatGPT’s recommendations diverse? We assess diversity using Gini and Item Coverage (ItemCV). A lower Gini indicates a higher concentration toward certain items, while higher coverage values indicate that more items from the catalog are recommended.

Facebook Books (Table [tab:combined_all]). ChatGPT-4 achieves a Gini of 0.1050 and an ItemCV of 1,004, outperforming ChatGPT-3.5 (Gini = 0.0713, ItemCV = 853) on both metrics. Although several baselines, such as ItemKNN (Gini = 0.5293, ItemCV = 2,141), still yield a better diversity score, both ChatGPT-4 and ChatGPT-3.5 generally rank above baselines such as MostPop and EASE$`^R`$. In terms of item coverage and Gini, ChatGPT models demonstrate a high concentration on specific items while covering nearly half of the total span (1,004 out of 2,234 items).

Last.FM (Table [tab:combined_all]). A similar trend emerges: ChatGPT-4 has a higher Gini (0.2023) than ChatGPT-3.5 (0.1927), indicating a lower concentration of recommendations on specific items. Additionally, GPT-4 covers 944 out of 1,507 items, whereas $`\textrm{RP}^3_{\beta}`$, which is designed to trade off diversity and accuracy, achieves a coverage value of 831 and a Gini of 0.1441, demonstrating ChatGPT’s strong ability to recommend diverse items.

MovieLens (Table [tab:combined_all]). ChatGPT-4 achieves a Gini of 0.0853, a slight improvement over ChatGPT-3.5 (0.0851). However, its item coverage spans 553 out of 1,862 items, which is comparatively lower than approaches such as $`\textrm{RP}^3_{\beta}`$(Gini = 0.1230, ItemCV = 744). These results highlight that, although the diversity score is lower than certain baselines, ChatGPT still presents a comparable diversity score on this dataset.
Summary (RQ1). ChatGPT’s recommendations are moderately diverse for Facebook Books and Last.FM, while exhibiting limited diversity on MovieLens, with GPT‑4 consistently outperforming GPT‑3.5. Although it does not match the highest-diversity baselines, it shows superior diversity compared to some CF and CBF approaches.
(RQ2.) Are ChatGPT’s recommendations novel? Novelty is measured via EPC (Expected Popularity Complement) and EFD (Expected Free Discovery), both interpreted such that higher values imply more novel recommendations.

Facebook Books (Table [tab:combined_all]). ChatGPT‑4 exhibits relatively high novelty (EPC=0.0353, EFD=0.3486), exceeding most baselines, including ChatGPT‑3.5 (EPC=0.0250, EFD=0.2480), and even surpassing all CF and CBF algorithms on these metrics.

Last.FM (Table [tab:combined_all]). Both ChatGPT versions rank above average in EPC and EFD, with CF and CBF methods (e.g., $`\textrm{RP}^3_{\beta}`$: EPC=0.2110, EFD=1.9970, VSM: EPC=0.1593, EFD=1.4845) performing at a comparable level. Still, the difference between ChatGPT‑4 (0.1918 EPC, 1.8663 EFD) and ChatGPT‑3.5 (0.1680 EPC, 1.6436 EFD) suggests GPT‑4 more effectively recommends less mainstream items.

MovieLens (Table [tab:combined_all]). On MovieLens, ChatGPT-4 (0.1453 EPC, 1.6010 EFD) outperforms ChatGPT-3.5 (0.1260 EPC, 1.3981 EFD) in terms of EPC and EFD values and places it on par with other methods (e.g., NeuMF: 0.1171 EPC, 1.2767 EFD), although lower than $`\textrm{RP}^3_{\beta}`$(EPC of 0.2421, EFD of 2.6613), the best model.
Summary (RQ2). ChatGPT’s recommendations exhibit above-average novelty in MovieLens and high novelty in Facebook Books and Last.FM, with GPT-4 generally surpassing GPT-3.5. The results suggest that ChatGPT, based on the user’s history, also recommends novel items for each user.
(RQ3.) Is ChatGPT affected by popularity bias? We examine popularity bias using APLT (Average Popularity of Long-Tail Items; higher indicates stronger inclination toward long-tail (less popular) items) and ARP (Average Rating-based Popularity; lower values imply less popularity bias).

Facebook Books (Table [tab:combined_all]). ChatGPT-3.5’s recommendations yield APLT = 0.1870 and ARP = 46, while ChatGPT-4 improves to APLT = 0.2424 and ARP = 40. With a higher APLT and lower ARP, GPT-4 demonstrates a better capability for recommending long-tail and less popular items than GPT-3.5. Although both models remain far from pure MostPop methods (ARP = 138), some baselines, such as AttributeItemKNN (APLT = 0.5879, ARP = 7) and VSM (APLT = 0.5761, ARP = 7), achieve better APLT and ARP values.

Last.FM (Table [tab:combined_all]). ChatGPT-3.5 has an APLT of 0.1391 and an ARP of 99, while GPT-4 has an APLT of 0.1267 and an ARP of 102, positioning it in the mid-range of models. This suggests that GPT-4 covers a smaller percentage of the long tail and tends to recommend more popular items. Although it outperforms certain baselines, such as $`\textrm{RP}^3_{\beta}`$(APLT = 0.0678, ARP = 153), it does not perform as well as other baselines, such as AttributeItemKNN (APLT = 0.3043, ARP = 87.8647).

MovieLens (Table [tab:combined_all]). ChatGPT shows an ARP of 90 for GPT-3.5 and 95 for GPT-4, which is lower than MostPop (ARP = 182) but higher than some graph-based methods (e.g., LightGCN: ARP = 43) or neighbor-based methods (e.g., AttributeItemKNN: ARP = 23). This indicates that its behavior is not as popularity-driven as MostPop but is still influenced by popular items. A similar trend is observed for APLT, further demonstrating that ChatGPT does not recommend items from the long tail and exhibits a degree of popularity bias.
Summary (RQ3). Although ChatGPT’s values are far from those obtained by MostPop, it still exhibits a tendency to recommend popular items, neglecting items in the long tail. In particular, GPT-4 demonstrates a lower ARP than ChatGPT-3.5, suggesting a tendency to recommend less popular items.
To conclude, ChatGPT models exhibit strong beyond-accuracy performance, achieving an optimal balance of novelty and diversity in the books domain, comparable results in the music domain, and suboptimal outcomes in the movie domain. Although it shows some inclination toward popular items, this bias is far less pronounced compared to MostPop or other strongly popularity-biased baselines. Furthermore, the improvements observed from GPT-3.5 to GPT-4 across all three datasets highlight the strength of GPT-4 for recommendations, particularly in balancing beyond-accuracy trade-offs.

These findings underscore the potential of ChatGPT as a recommender system while also highlighting areas for improvement, particularly in refining its ability to balance relevance, diversity, and novelty across domains.

User Cold-Start Scenario

We now examine user cold‐start recommendations, defined here as scenarios where each user has provided a maximum of ten interactions. Table [tab:coldstart_10] details these results across three datasets, Facebook Books, Last.FM, and MovieLens, comparing ChatGPT‐3.5 and ChatGPT‐4 to strong Collaborative Filtering (CF) and Content‐Based Filtering (CBF) baselines. Our central question is:
RQ4: How effective is ChatGPT in user cold-start scenario across accuracy and beyond-accuracy dimensions?

Accuracy under Cold‐Start. Despite limited user interactions, ChatGPT exhibits competitive to superior accuracy compared to traditional baselines. For Facebook Books, GPT-4 achieves higher nDCG (0.0538) and Recall (0.0873) than all baselines, including $`\textrm{RP}^3_{\beta}`$ (nDCG = 0.0346) and AttributeItemKNN (0.0335). GPT-3.5 also surpasses these baselines but is slightly behind GPT-4. For Last.FM, ChatGPT maintains robust performance ($`nDCG\geq0.2791`$, $`Recall\geq0.3423`$), outperforming MostPop (nDCG = 0.0529) and random baselines by a wide margin. Although $`\textrm{RP}^3_{\beta}`$ leads in nDCG (0.2389), GPT-4 often excels in Recall and Precision. For MovieLens, GPT-4 attains the highest nDCG (0.1405), surpassing both CF and CBF baselines, while ChatGPT-3.5 (0.1117) also remains competitive. These results underscore ChatGPT’s capacity to identify relevant items effectively from few interactions.

Beyond‐Accuracy in Cold‐Start.

Diversity. GPT‑4 generally surpasses GPT‑3.5 in Gini and item coverage across all three datasets (e.g., increasing from 0.0538 to 0.0846 in Gini on Facebook Books), indicating that GPT‑4’s recommendations span a broader set of items. Although baselines like $`\textrm{RP}^3_{\beta}`$achieve higher coverage in MovieLens and Facebook Books, GPT-4 performs best on Last.FM.

Novelty. ChatGPT’s EPC and EFD values exceed those of CF and CBF baselines across all datasets (e.g., GPT‑4’s EPC = 0.0186 vs. $`\textrm{RP}^3_{\beta}`$= 0.0115 on Facebook Books), implying a tendency to recommend novel items rather than relying on mainstream items.

Popularity Bias. ChatGPT exhibits a moderate inclination toward popular items compared to baselines across all datasets. Nonetheless, it remains far from MostPop (e.g., $`ARP\geq139`$ on Facebook Books) but is comparable to some baselines (e.g., AttributeUserKNN), indicating room for further mitigation strategies.
In summary, ChatGPT proves highly effective in cold-start scenarios by: (i) maintaining strong accuracy despite minimal user interactions, with GPT-4 often outperforming GPT-3.5; (ii) striking a balance among diversity, novelty, and popularity bias; (iii) demonstrating consistent improvements over baselines, underscoring ChatGPT’s capacity to infer user interests with limited interactions.

Conclusion

In this work, we explore the diversity, novelty, and popularity bias of ChatGPT recommendations. Our findings demonstrate that for the Facebook Books, Last.FM, and MovieLens datasets, ChatGPT models exhibit strong beyond-accuracy performance, achieving an optimal balance of novelty and diversity in Facebook Books, comparable results for Last.FM, and suboptimal outcomes for MovieLens.

Additionally, we show that while ChatGPT demonstrates a good balance between novelty and diversity, it also exhibits a tendency to recommend popular items, especially in the MovieLens dataset.

Finally, we extend our exploration to the user cold-start scenario, where ChatGPT proves highly effective by maintaining strong accuracy despite minimal user interactions, balancing diversity, novelty, and popularity bias, and demonstrating consistent improvements over baselines.

These findings underscore the beyond-accuracy capabilities of ChatGPT as a recommender system. Future research will include additional datasets to generalize the findings across domains, as well as experiments comparing ChatGPT with other LLMs such as Gemini, LLaMA, and DeepSeek.

Limitation

Nowadays, LLMs are used to augment the capabilities of recommender systems. However, these models are typically trained on vast internet-scale corpora, which may include portions of open datasets used for benchmarking. Recent work studying memorization in MovieLens-1M shows that models like GPTs and LLaMA-3 can memorize such datasets, with larger models exhibiting higher memorization rates. For example, the reported memorization rate is 12.9% for LLaMA-3.1 405B and 80.76% for GPT-4. Further research should focus on understanding the correlation between improvements in recommendation quality and memorization capacity.

84 natexlab F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems Handbook, Springer, US, 2022. S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning based recommender system: A survey and new perspectives, Comput. Surv. 52 (2019) 5:1–5:38. J. Chung, Ç. Gülçehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, CoRR abs/1412.3555 (2014). URL: http://arxiv.org/abs/1412.3555. arXiv:1412.3555. A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, CoRR abs/2312.00752 (2023). URL: https://doi.org/10.48550/arXiv.2312.00752. doi:10.48550/ARXIV.2312.00752. arXiv:2312.00752. J. Devlin, M. Chang, K. Lee, K. Toutanova, pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT (1), 2019, pp. 4171–4186. A. D. Bellis, V. W. Anelli, T. D. Noia, E. D. Sciascio, prompt-based detection of semantic containment patterns in mlms, in: G. Demartini, K. Hose, M. Acosta, M. Palmonari, G. Cheng, H. Skaf-Molli, N. Ferranti, D. Hernández, A. Hogan (Eds.), The Semantic Web - ISWC 2024 - 23rd International Semantic Web Conference, Baltimore, MD, USA, November 11-15, 2024, Proceedings, Part II, volume 15232 of Lecture Notes in Computer Science, Springer, 2024, pp. 227–246. URL: https://doi.org/10.1007/978-3-031-77850-6_13. doi:10.1007/978-3-031-77850-6_13. G. Servedio, A. De Bellis, D. Di Palma, V. W. Anelli, T. Di Noia, Are the hidden states hiding something? testing the limits of factuality-encoding capabilities in llms, arXiv preprint arXiv:2505.16520 (2025). D. Di Palma, A. De Bellis, G. Servedio, V. W. Anelli, F. Narducci, T. Di Noia, Llamas have feelings too: Unveiling sentiment and emotion representations in llama models through probing, arXiv preprint arXiv:2505.16491 (2025). P. Aghilar, V. W. Anelli, M. Trizio, E. Di Sciascio, T. Di Noia, Training-free, identity-preserving image editing for fashion pose alignment and normalization, Expert Systems with Applications 293 (2025a) 128579. doi:https://doi.org/10.1016/j.eswa.2025.128579. P. Aghilar, V. W. Anelli, A. Lops, F. Narducci, A. Ragone, S. Roccotelli, M. Trizio, Adaptive user modeling in visual merchandising: Balancing brand identity with operational efficiency, in: Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, UMAP 2025, New York City, NY, USA, June 16-19, 2025, ACM, 2025b, pp. 358–360. URL: https://doi.org/10.1145/3699682.3730976. doi:10.1145/3699682.3730976. Y. Ping, Y. Li, J. Zhu, Beyond accuracy measures: the effect of diversity, novelty and serendipity in recommender systems on user engagement, Electronic Commerce Research (2024) 1–28. T. Duricic, D. Kowald, E. Lacic, E. Lex, Beyond-accuracy: a review on diversity, serendipity, and fairness in recommender systems based on graph neural networks, Frontiers Big Data 6 (2024). S. Karimi, H. A. Rahmani, M. Naghiaei, L. Safari, Provider fairness and beyond-accuracy trade-offs in recommender systems, CoRR abs/2309.04250 (2023). M. Attimonelli, A. D. Bellis, C. Pomo, D. Jannach, E. D. Sciascio, T. D. Noia, Do we really need specialization? evaluating generalist text embeddings for zero-shot recommendation and search, in: RecSys, ACM, 2025. URL: https://doi.org/10.1145/3705328.3748040. doi:10.1145/3705328.3748040. D. Di Palma, G. Servedio, V. W. Anelli, G. M. Biancofiore, F. Narducci, L. Carnimeo, T. D. Noia, Beyond words: Can chatgpt support state-of-the-art recommender systems?, in: IIR, volume 3802 of CEUR Workshop Proceedings, CEUR-WS.org, 2024, pp. 13–22. M. Valentini, Cooperative and competitive llm-based multi-agent systems for recommendation, in: C. Hauff, C. Macdonald, D. Jannach, G. Kazai, F. M. Nardini, F. Pinelli, F. Silvestri, N. Tonellotto (Eds.), Advances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part V, volume 15576 of Lecture Notes in Computer Science, Springer, 2025, pp. 204–211. Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley, W. X. Zhao, Large language models are zero-shot rankers for recommender systems, in: ECIR (2), volume 14609 of Lecture Notes in Computer Science, Springer, 2024, pp. 364–381. D. Di Palma, Retrieval-augmented recommender system: Enhancing recommender systems with large language models, in: RecSys, ACM, 2023, pp. 1369–1373. S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun, X. Zhang, J. Xu, Uncovering chatgpt’s capabilities in recommender systems, in: RecSys, ACM, 2023, pp. 1126–1132. M. Attimonelli, D. Danese, D. Malitesta, C. Pomo, G. Gassi, T. D. Noia, Ducho 2.0: Towards a more up-to-date unified framework for the extraction of multimodal features in recommendation, in: WWW (Companion Volume), ACM, 2024, pp. 1075–1078. J. Liu, C. Liu, R. Lv, K. Zhou, Y. Zhang, Is chatgpt a good recommender? A preliminary study, CoRR abs/2304.10149 (2023). D. Carraro, D. Bridge, Enhancing recommendation diversity by re-ranking with large language models, ACM Trans. Recomm. Syst. (2024). URL: https://doi.org/10.1145/3700604. doi:10.1145/3700604. Y. Tokutake, K. Okamoto, Can large language models assess serendipity in recommender systems?, J. Adv. Comput. Intell. Intell. Informatics 28 (2024) 1263–1272. D. Di Palma, G. M. Biancofiore, V. W. Anelli, F. Narducci, T. D. Noia, Content-based or collaborative? insights from inter-list similarity analysis of chatgpt recommendations, in: UMAP (Adjunct Publication), ACM, 2025, pp. 28–33. Y. Deldjoo, Understanding biases in chatgpt-based recommender systems: Provider fairness, temporal stability, and recency, ACM Trans. Recomm. Syst. (2024). URL: https://doi.org/10.1145/3690655. doi:10.1145/3690655. J. Zhang, K. Bao, Y. Zhang, W. Wang, F. Feng, X. He, Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation, in: RecSys, 2023, pp. 993–999. A. C. M. Mancino, A. Ferrara, S. Bufi, D. Malitesta, T. D. Noia, E. D. Sciascio, Kgtore: Tailored recommendations through knowledge-aware GNN models, in: RecSys, 2023, pp. 576–587. S. Bufi, A. C. M. Mancino, A. Ferrara, D. Malitesta, T. D. Noia, E. D. Sciascio, simple knowledge-aware graph-based recommender with user-based semantic features filtering, in: IRonGraphs, volume 2197 of Communications in Computer and Information Science, Springer, 2024, pp. 41–59. F. M. Harper, J. A. Konstan, The movielens datasets: History and context, Trans. Interact. Intell. Syst. 5 (2016) 19:1–19:19. I. Cantador, P. Brusilovsky, T. Kuflik, Second workshop on information heterogeneity and fusion in recommender systems (hetrec2011), in: RecSys, ACM, New York, NY, USA, 2011, pp. 387–388. G. M. Biancofiore, D. Di Palma, C. Pomo, F. Narducci, T. Di Noia, Conversational user interfaces and agents, in: Human-Centered AI: An Illustrated Scientific Quest, Springer, 2025, pp. 399–438. J. L. Herlocker, J. A. Konstan, L. G. Terveen, J. Riedl, Evaluating collaborative filtering recommender systems, Trans. Inf. Syst. 22 (2004) 5–53. T. Silveira, M. Zhang, X. Lin, Y. Liu, S. Ma, How good your recommender system is? A survey on evaluations in recommendation, Int. J. Mach. Learn. Cybern. 10 (2019) 813–831. M. Karimi, D. Jannach, M. Jugovac, News recommender systems - survey and roads ahead, Inf. Process. Manag. 54 (2018) 1203–1227. A. Gunawardana, G. Shani, S. Yogev, Evaluating recommender systems, in: Recommender Systems Handbook, Springer, US, 2022, pp. 547–601. M. Kaminskas, D. Bridge, Diversity, serendipity, novelty, and coverage: A survey and empirical analysis of beyond-accuracy objectives in recommender systems, Trans. Interact. Intell. Syst. 7 (2017) 2:1–2:42. P. Cheng, S. Wang, J. Ma, J. Sun, H. Xiong, Learning to recommend accurate and diverse items, in: WWW, ACM, 2017, pp. 183–192. W. Wu, L. Chen, Y. Zhao, Personalizing recommendation diversity based on user personality, User Model. User Adapt. Interact. 28 (2018) 237–276. M. Nakatsuji, Y. Fujiwara, A. Tanaka, T. Uchiyama, K. Fujimura, T. Ishida, Classical music for rock fans?: novel recommendations for expanding user interests, in: CIKM, ACM, 2010, pp. 949–958. M. Cai, L. Chen, Y. Wang, H. Bai, P. Sun, L. Wu, M. Zhang, M. Wang, Popularity-aware alignment and contrast for mitigating popularity bias, in: KDD, ACM, 2024, pp. 187–198. V. Paparella, D. Di Palma, V. W. Anelli, T. D. Noia, Broadening the scope: Evaluating the potential of recommender systems beyond prioritizing accuracy, in: RecSys, ACM, 2023, pp. 1139–1145. D. Jannach, L. Lerche, I. Kamehkhosh, M. Jugovac, What recommenders recommend: an analysis of recommendation biases and possible countermeasures, User Model. User Adapt. Interact. 25 (2015) 427–491. G. Adomavicius, J. Zhang, Impact of data characteristics on recommender systems performance, Trans. Manag. Inf. Syst. 3 (2012) 3:1–3:17. S. Vargas, P. Castells, Rank and relevance in novelty and diversity metrics for recommender systems, in: RecSys, ACM, 2011, pp. 109–116. H. Abdollahpouri, R. Burke, B. Mobasher, Managing popularity bias in recommender systems with personalized re-ranking, in: FLAIRS, AAAI Press, 2019, pp. 413–418. Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, J. Zhang, Chat-rec: Towards interactive and explainable llms-augmented recommender system, CoRR abs/2303.14524 (2023). A. Manzoor, S. C. Ziegler, K. M. P. Garcia, D. Jannach, Chatgpt as a conversational recommender system: A user-centric analysis, in: UMAP, ACM, 2024, pp. 267–272. S. Sanner, K. Balog, F. Radlinski, B. Wedin, L. Dixon, Large language models are competitive near cold-start recommenders for language- and item-based preferences, in: RecSys, 2023, pp. 890–896. Z. Li, Y. Chen, X. Zhang, X. Liang, Bookgpt: A general framework for book recommendation empowered by large language model, Electronics 12 (2023) 4654. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: NeurIPS, 2020. A. Kong, S. Zhao, H. Chen, Q. Li, Y. Qin, R. Sun, X. Zhou, E. Wang, X. Dong, Better zero-shot reasoning with role-play prompting, in: NAACL-HLT, Association for Computational Linguistics, 2024, pp. 4099–4113. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners, CoRR abs/2205.11916 (2022). J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in: NeurIPS, 2022. S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, in: NeurIPS, 2023. N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, S. Yao, Reflexion: language agents with verbal reinforcement learning, in: NeurIPS, 2023. Z. Liu, X. Yu, Y. Fang, X. Zhang, Graphprompt: Unifying pre-training and downstream tasks for graph neural networks, in: WWW, ACM, 2023, pp. 417–428. P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt engineering in large language models: Techniques and applications, CoRR abs/2402.07927 (2024). L. Xu, J. Zhang, B. Li, J. Wang, M. Cai, W. X. Zhao, J. Wen, Prompting large language models for recommender systems: A comprehensive framework and empirical analysis, CoRR abs/2401.04997 (2024). A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. J. Jin, X. Chen, F. Ye, M. Yang, Y. Feng, W. Zhang, Y. Yu, J. Wang, Lending interaction wings to recommender systems with conversational agents, in: NeurIPS, 2023. A. Kong, S. Zhao, H. Chen, Q. Li, Y. Qin, R. Sun, X. Zhou, Better zero-shot reasoning with role-play prompting, CoRR abs/2308.07702 (2023). A. C. M. Mancino, S. Bufi, A. di Fazio, A. Ferrara, D. Malitesta, C. Pomo, T. D. Noia, Datarec: A python library for standardized and reproducible data management in recommender systems, in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy July 13-18, 2025, ACM, 2025. URL: https://doi.org/10.1145/3726302.3730320. doi:10.1145/3726302.3730320. V. Paparella, V. W. Anelli, F. M. Nardini, R. Perego, T. D. Noia, Post-hoc selection of pareto-optimal solutions in search and recommendation, in: CIKM, ACM, 2023, pp. 2013–2023. V. W. Anelli, A. Bellogı́n, A. Ferrara, D. Malitesta, F. A. Merra, C. Pomo, F. M. Donini, T. D. Noia, Elliot: A comprehensive and rigorous framework for reproducible recommender systems evaluation, in: SIGIR, ACM, New York, NY, USA, 2021, pp. 2405–2414. A. Ferrara, V. W. Anelli, A. C. M. Mancino, T. D. Noia, E. D. Sciascio, Kgflex: Efficient recommendation with sparse feature factorization and knowledge graphs, ACM Trans. Recomm. Syst. (2023). U. Javed, K. Shaukat, I. A. Hameed, F. Iqbal, T. M. Alam, S. Luo, A review of content-based and context-based recommendation systems, International Journal of Emerging Technologies in Learning (iJET) 16 (2021) 274–306. B. Paudel, F. Christoffel, C. Newell, A. Bernstein, Updatable, accurate, diverse, and scalable recommendations for interactive applications, Trans. Interact. Intell. Syst. 7 (2017) 1:1–1:34. X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang, Lightgcn: Simplifying and powering graph convolution network for recommendation, in: SIGIR, 2020, pp. 639–648. C. Cooper, S. Lee, T. Radzik, Y. Siantos, Random walks in recommender systems: exact computation and simulations, in: WWW (Companion Volume), 2014, pp. 811–816. B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, Analysis of recommendation algorithms for e-commerce, in: EC, ACM, New York, NY, USA, 2000, pp. 158–167. J. S. Breese, D. Heckerman, C. M. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: UAI, 1998, pp. 43–52. H. Steck, Embarrassingly shallow autoencoders for sparse data, in: WWW, ACM, New York, NY, USA, 2019, pp. 3251–3257. S. Rendle, W. Krichene, L. Zhang, J. R. Anderson, Neural collaborative filtering vs. matrix factorization revisited, in: RecSys, 2020, pp. 240–248. X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T. Chua, Neural collaborative filtering, in: WWW, 2017, pp. 173–182. G. Salton, A. Wong, C. Yang, A vector space model for automatic indexing, Commun. ACM 18 (1975) 613–620. Z. Gantner, S. Rendle, C. Freudenthaler, L. Schmidt-Thieme, Mymedialite: a free recommender system library, in: RecSys, ACM, New York, NY, USA, 2011, pp. 305–308. Y. Chen, Q. Fu, Y. Yuan, Z. Wen, G. Fan, D. Liu, D. Zhang, Z. Li, Y. Xiao, Hallucination detection: Robustly discerning reliable answers in large language models, in: CIKM, 2023, pp. 245–255. F. Nie, J. Yao, J. Wang, R. Pan, C. Lin, A simple recipe towards reducing hallucination in neural surface realisation, in: ACL (1), 2019, pp. 2673–2679. Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, P. Fung, Survey of hallucination in natural language generation, Comput. Surv. 55 (2023) 248:1–248:38. V. E. Giuliano, P. E. J. Jr., G. E. Kimball, R. F. Meyer, B. A. Stein, Automatic pattern recognition by a gestalt method, Inf. Control. 4 (1961) 332–345. A. V. Petrov, C. MacDonald, gsasrec: Reducing overconfidence in sequential recommendation trained with negative sampling, in: RecSys, 2023, pp. 116–128. D. L. Olson, D. Delen, Advanced Data Mining Techniques, Springer, US, 2008. K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques, Trans. Inf. Syst. 20 (2002) 422–446. D. Di Palma, F. A. Merra, M. Sfilio, V. W. Anelli, F. Narducci, T. Di Noia, Do llms memorize recommendation datasets? a preliminary study on movielens-1m, in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy July 13-18, 2025, ACM, 2025. URL: https://doi.org/10.1145/3726302.3730178. doi:10.1145/3726302.3730178.

Read Full PDF on ArXiv

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

While we acknowledge that diversity can be measured in multiple ways, such as Top-N diversity (user-level variation in recommendation lists) or temporal diversity (diversity over time), we focus on aggregate diversity due to its measurable implications for item exposure, and long-tail promotion. ↩︎
https://2015.eswc-conferences.org/program/semwebeval.html ↩︎

Exploring Diversity, Novelty, and Popularity Bias in ChatGPT s Recommendations

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Introduction

Methodology

Prompt Design

Experimental Setup

Experimental Results

ChatGPT Beyond-Accuracy Recommendation Performance

User Cold-Start Scenario

Conclusion

Limitation

A Note of Gratitude

Table of Contents

Table of Contents

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Introduction

Related Work

Methodology

Prompt Design

Experimental Setup

Experimental Results

ChatGPT Beyond-Accuracy Recommendation Performance

User Cold-Start Scenario

Conclusion

Limitation

A Note of Gratitude

Related Posts

A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets

A Comprehensive Dataset for Human vs. AI Generated Image Detection

A Generalized UCB Bandit Algorithm for ML-Based Estimators

Start searching

No results found