Diffusion Models in Recommendation Systems: A Survey

Diffusion Models in Recommendation Systems: A Survey
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recommender systems remain an essential topic due to its wide application and business potential. Given the great generation capability exhibited by diffusion models in computer vision recently, many recommender systems have adopted diffusion models and found improvements in performance for various tasks. Research in this domain has been growing rapidly and calling for a systematic survey. In this survey paper, we propose and present a taxonomy based on three orthogonal axes to categorize recommender systems that utilize diffusion models. Distinct from a prior survey paper that categorizes based on the role of the diffusion model, we categorize based on the recommendation task at hand. The decision originates from the rationale that after all, the adoption of diffusion models is to enhance the recommendation performance, not vice versa: adapting the recommendation task to enable diffusion models. Nonetheless, we offer a unique perspective for diffusion models in recommender systems complementary to existing surveys. We present the foundational algorithms in diffusion models and their applications in recommender systems to summarize the rapid development in this field. Finally, we discuss open research directions to prepare and encourage further efforts to advance the field. We compile the relevant papers in a public GitHub repository.


💡 Research Summary

This survey provides a comprehensive overview of the rapidly growing body of work that applies diffusion models to recommender systems. The authors begin by motivating the need for generative approaches in recommendation: traditional collaborative filtering suffers from severe data sparsity, while generative adversarial networks (GANs) are plagued by training instability and mode collapse. Diffusion models—originally introduced for image synthesis through score‑matching and denoising diffusion probabilistic models (DDPM)—offer a stable training paradigm, explicit likelihood‑free sampling, and inherent diversity thanks to their stochastic denoising process.

To capture the breadth of the field, the authors collected 188 relevant papers (through systematic Google Scholar queries and manual reference checks) and organized them along three orthogonal axes: (1) Core Recommendation Tasks, (2) Data Modality & Domain, and (3) Trustworthy Objectives & Constraints. This three‑axis taxonomy distinguishes their work from prior surveys that categorize by the role of the diffusion model (e.g., data augmentation, model backbone, or content generation).

Axis 1 – Core Recommendation Tasks
The survey splits recommendation into collaborative filtering (CF) and sequential recommendation. Within CF, it further distinguishes settings that use implicit feedback, explicit ratings, item‑graph side information, or user‑graph side information. For sequential recommendation, point‑of‑interest (POI) recommendation is treated as a representative case, and the authors identify three ways the diffusion process can interact with a user’s interaction sequence: (i) the sequence is both the diffusion target and guidance, (ii) the sequence is only the diffusion target, and (iii) the sequence serves solely as guidance. Representative works such as DiRec (graph‑aware diffusion), DiSeq (sequence‑as‑target diffusion), and DiGuide (sequence‑as‑guidance diffusion) are discussed, highlighting how noise is injected into graph or sequence representations and later removed to recover latent user‑item affinities.

Axis 2 – Data Modality & Domain
Here the survey maps diffusion‑based recommenders onto the nature of the data they handle. Four major categories emerge: (a) image generation – where user‑uploaded photos are denoised or augmented to improve visual recommendation; (b) multimodal recommendation – where diffusion operates in a latent space that jointly encodes text, image, and audio signals; (c) cross‑domain recommendation – where a diffusion model trained on one domain (e.g., movies) is transferred to another (e.g., books) via latent alignment; and (d) text‑to‑recommendation – where large language models (LLMs) generate textual prompts that are fed into a diffusion pipeline to produce item embeddings directly from user queries. The authors emphasize that the conditional generation capabilities of diffusion models (classifier‑guided, classifier‑free guidance, prompt‑based conditioning) make them especially suitable for these heterogeneous scenarios.

Axis 3 – Trustworthy Objectives & Constraints
The third axis captures broader system‑level concerns that have become central to modern recommender design: fairness, accountability, transparency, and out‑of‑distribution (OOD) robustness. Because diffusion models generate a distribution of candidate items rather than a single deterministic output, they naturally mitigate mode collapse and can be tuned to produce balanced recommendations across demographic groups. The iterative denoising steps also provide a traceable chain of computations, enabling post‑hoc explanations of why a particular item was sampled. For OOD robustness, the survey notes that diffusion models can continue sampling from unseen noise levels, offering graceful degradation when confronted with novel items or domains.

Foundations of Diffusion Models
The paper dedicates a full section to the mathematical underpinnings of diffusion models. It revisits the original NCSN framework, which learns the score function ∇ₓ log p(x) via score‑matching across a geometric schedule of Gaussian noise levels, and samples using annealed Langevin dynamics. It then details DDPM, which defines a forward Markov chain q(xₜ|xₜ₋₁) with a fixed β‑schedule and learns a reverse denoising network (typically a U‑Net) to predict the added noise. The authors also discuss efficiency‑enhancing variants such as DDIM (deterministic sampling), latent diffusion models (operating in compressed latent spaces), and ODE‑based solvers that reduce the number of inference steps dramatically. Conditional generation techniques—including classifier‑guided diffusion, classifier‑free guidance, and prompt‑conditioning—are explained with equations and practical implementation notes.

Datasets & Evaluation
A comprehensive catalog of datasets is provided, ranging from classic recommendation benchmarks (MovieLens, Amazon Reviews, Yelp, Gowalla, Last.fm) to multimodal corpora (MS‑COCO, Flickr30K, VQA) used in cross‑modal recommendation studies. For each dataset the survey lists the associated task (CF, sequential, POI, multimodal), the modality, and typical evaluation metrics (NDCG, Recall@K, Hit Rate, fairness metrics such as demographic parity, and OOD detection scores).

Open Research Directions
The authors identify four high‑impact avenues for future work:

  1. Scalable Sampling for Real‑Time Systems – Developing lightweight samplers (e.g., fewer diffusion steps, distillation, or hybrid diffusion‑GAN approaches) that meet latency constraints of production recommender pipelines.
  2. Privacy‑Preserving Conditional Diffusion – Designing mechanisms that allow conditioning on user‑specific signals while guaranteeing differential privacy, thereby reconciling personalization with regulatory requirements.
  3. Quantitative Trustworthiness Benchmarks – Constructing standardized testbeds that jointly evaluate recommendation accuracy, fairness, transparency, and OOD robustness, enabling fair comparison across methods.
  4. Meta‑Learning for Cross‑Domain Transfer – Leveraging meta‑learning or continual learning to adapt a pre‑trained diffusion model to new domains or modalities with minimal data, addressing the cold‑start problem at scale.

Conclusion
By framing diffusion‑based recommendation research through the lenses of task, modality, and trustworthiness, this survey offers a clear roadmap for both newcomers and seasoned researchers. It demonstrates that diffusion models not only overcome the instability and mode‑collapse issues of GANs but also provide a flexible conditional generation framework that can be tailored to diverse recommendation scenarios—from pure collaborative filtering to multimodal, cross‑domain, and fairness‑aware systems. The extensive taxonomy, dataset inventory, and forward‑looking research agenda together constitute a valuable reference for advancing the state of the art in generative recommendation technologies.


Comments & Academic Discussion

Loading comments...

Leave a Comment