Generalization of Diffusion Models Arises with a Balanced Representation Space
Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized spiky representations, whereas (ii) generalization arises when the model captures local data statistics, producing balanced representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.
💡 Research Summary
This paper investigates why diffusion models sometimes merely memorize training data and other times generate novel samples, by focusing on the structure of the intermediate representation space. The authors study a minimal two‑layer ReLU denoising autoencoder (DAE) that mirrors the denoising step used in diffusion models. Under a mixture‑of‑Gaussians data assumption, they introduce the notion of ((\alpha,\beta))-separability, which captures tight within‑cluster concentration and clear inter‑cluster margins.
The central theoretical contribution (Theorem 3.1) shows that any local minimizer of the regularized empirical loss can be expressed as a block‑wise composition of cluster‑specific weight matrices plus a small residual that decays exponentially with the noise level and the cluster margin. Building on this structure, three learning regimes are identified:
-
Memorization regime – When the model is heavily over‑parameterized relative to the amount of data (especially sparse clusters), the optimal weights store each training sample directly. The encoder’s hidden representation becomes “spiky”: only a few neurons fire strongly, and the decoder reconstructs the exact training example.
-
Generalization regime – When the model is under‑parameterized but each cluster contains many examples, the weights learn local data statistics (means and covariances). The hidden representation is “balanced”: activations are spread across many neurons, and the decoder produces new samples that follow the underlying distribution. Jacobian analysis confirms that the mapping reflects local statistics rather than a single data point.
-
Hybrid regime – Real‑world datasets are imbalanced; abundant clusters fall into the generalization regime while rare clusters fall into the memorization regime. Consequently, the same model can exhibit both spiky and balanced representations depending on the input’s cluster.
Empirical validation is performed on three fronts. First, synthetic experiments with the two‑layer ReLU DAE reproduce the predicted spiky versus balanced activation patterns across different noise levels and regularization strengths. Second, the authors extract intermediate bottleneck activations from large‑scale diffusion models (Stable Diffusion v1.4, DiT, EDM) and observe the same dichotomy: memorized images yield low‑rank, highly concentrated activations, whereas generated images show high‑rank, distributed activations. Third, they propose two practical tools derived from these insights.
Memorization detection leverages the spikiness metric (e.g., ratio of maximum activation to overall L2 norm) to flag samples that are likely memorized, achieving higher precision and recall than prior memory‑based methods without requiring any prompts.
Representation‑space steering adds a linear offset in the hidden space to impose a desired style or attribute. Balanced representations respond smoothly, enabling controllable editing, while spiky representations remain largely unchanged, demonstrating that memorized samples are hard to steer.
The paper concludes that the geometry of the representation space is the key determinant of whether a diffusion model memorizes or generalizes. This perspective unifies theoretical analysis, empirical observation, and practical applications such as privacy‑preserving auditing and user‑friendly image editing. Limitations include the focus on a fixed noise level and a simple ReLU architecture; extending the theory to multi‑step schedules, deeper networks, and other activations remains an open direction. Overall, the work offers a compelling, representation‑centric framework that bridges the gap between diffusion model theory and real‑world deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment