Generative Artificial Intelligence (AI), such as large language models (LLMs), has become a transformative force across science, industry, and society. As these systems grow in popularity, web data becomes increasingly interwoven with this AI-generated material and it is increasingly difficult to separate them from naturally generated content. As generative models are updated regularly, later models will inevitably be trained on mixtures of human-generated data and AI-generated data from earlier versions, creating a recursive training process with data contamination. Existing theoretical work has examined only highly simplified settings, where both the real data and the generative model are discrete or Gaussian, where it has been shown that such recursive training leads to model collapse. However, real data distributions are far more complex, and modern generative models are far more flexible than Gaussian and linear mechanisms. To fill this gap, we study recursive training in a general framework with minimal assumptions on the real data distribution and allow the underlying generative model to be a general universal approximator. In this framework, we show that contaminated recursive training still converges, with a convergence rate equal to the minimum of the baseline model's convergence rate and the fraction of real data used in each iteration. To the best of our knowledge, this is the first (positive) theoretical result on recursive training without distributional assumptions on the data. We further extend the analysis to settings where sampling bias is present in data collection and support all theoretical results with empirical studies.
Recent analyses suggest that Artificial Intelligence (AI) generated text, images, and code constitute an increasingly large share of online content. Journalistic investigations have documented widespread use of AI-generated text across platforms such as Wikipedia and Figure 1: Horizontal axis: convergence rate of the baseline generative model; Vertical axis: the fraction of real data. The color indicates which quantity controls the overall convergence rate: red corresponds to the regime in which the rate is limited by the real-data fraction, blue corresponds to the regime in which the rate is limited by the baseline rate, and the diagonal line marks the phase transition between these two regimes.
In related fashion, generative models trained on biased and non-representative datasets are known to cause significant biases in their generated content (Mehrabi et al., 2021;Zhou et al., 2024). The poor representation or bias in common large datasets used in AI training has been well documented, as well as the downstream effects on popular LLM models such as GPT (Lucy and Bamman, 2021;Sheng et al., 2019). The dangers of this are profound, as content generated by these models can further perpetuate the biases they were trained on, leading to unfair and unjust decisions in critical areas such as justice, welfare, and employment (Mehrabi et al., 2021). Thus an extensive line of work has examined the consequences of training on biased datasets (Zhou et al., 2024;Cross et al., 2024). Although numerous methods have been investigated to mitigate sampling bias (He and Garcia, 2009;Cortes and Mohri, 2014;Chen et al., 2023b), a fundamental question remains unresolved: if a generative model is trained on a biased dataset, how effective are subsequent improved sampling strategies or bias correction methods, in guiding the model toward the true distribution of interest?
To answer this question, we extend our analysis to the recursive contaminated setting where the real data is drawn from biased sampling distributions. We derive conditions under which the generator will either converge to the biased distribution and fail to remove the bias, converge to the true distribution at a slower rate, or converge to the true distribution at the same rate as in the unbiased setting. In particular, if the bias in the real data is not corrected, the generator simply learns the biased distribution and does not recover the true one. If the practitioner reduces bias so that the real sampling distribution converges to the true distribution fast enough, then the generator converges at the standard rate. If the bias decays too slowly, the generator’s convergence rate is instead limited by the bias’ decay rate. This shows that even when the initial samples are biased, successive improvements in sampling procedure or other bias-correction methods can still lead later iterations of the model to converge.
Taken together, these results show that, under appropriate baseline convergence rate, real data fractions, and bias decay, the generators produced by these recursive procedures can still converge to the true distribution within our framework, even when trained on a mixture of synthetic and potentially biased real data. We support these theoretical results by empirical studies. All proofs and additional experimental details are provided in the Appendix.
In this section, we review recursive training paradigms that have been studied in the existing literature. Thus far, much of the literature has aimed to study conditions for model collapse under different data contamination schemes in varying recursive training settings. For instance, Shumailov et al. (2024) study iterative training in which each generation is trained exclusively on synthetic samples produced by the previous generator. Under discrete and Gaussian models trained by maximum likelihood, they prove that such synthetic-only recursive training inevitably leads to model collapse, and their experiments confirm severe degradation.
Building on this line of work, Suresh et al. ( 2024) provide explicit collapse rate guarantees for maximum-likelihood training under discrete and Gaussian models, giving quantitative bounds on how quickly recursive training diverges from the target distribution. Again in this setting, the generator for each iteration is trained exclusively on synthetic output of the previous iteration’s generator, with no data accumulation across iterations and no re-introduction of real data.
Other works have expanded the synthetic-only data case, and have studied the role of the introduction of real data in recursive training frameworks. Shumailov et al. (2023) analyze a closely related synthetic-only setting and similarly show that recursive training on model-generated samples leads to collapse. However, they also introduce an empirical study on what the authors denote a partial-refresh regime, in which a 10% subsample of the original real data is mixed in each iteration. Th
This content is AI-processed based on open access ArXiv data.