Reading time: 15 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.20905
  • Date:
  • Authors: Unknown

📝 Abstract

Deep clustering methods typically rely on a single, well-defined representation for clustering. In contrast, pretrained diffusion models provide abundant and diverse multi-scale representations across network layers and noise timesteps. However, a key challenge is how to efficiently identify the most clustering-friendly representation in the layer×timestep space. To address this issue, we propose Diffusion Embedded Clustering (DiEC), an unsupervised framework that performs clustering by leveraging optimal intermediate representations from pretrained diffusion models. DiEC systematically evaluates the clusterability of representations along the trajectory of network depth and noise timesteps. Meanwhile, an unsupervised search strategy is designed for recognizing the Clustering-optimal Layer (COL) and Clusteringoptimal Timestep (COT) in the layer×timestep space of pretrained diffusion models, aiming to promote clustering performance and reduce computational overhead. DiEC is fine-tuned primarily with a structure-preserving DEC-style KLdivergence objective at the fixed COL + COT, together with a random-timestep diffusion denoising objective to maintain the generative capability of the pretrained model. Without relying on augmentation-based consistency constraints or contrastive learning, DiEC achieves excellent clustering performance across multiple benchmark datasets.

📄 Full Content

Clustering aims to uncover the intrinsic structure of unlabeled data. However, modern datasets are often high-dimensional, strongly nonlinear, and heavily contaminated by noise, making classical clustering methods that rely on shallow features and fixed distance metrics unreliable in practice [Liu et al., 2016;Liu et al., 2017]. By coupling neural representation learning with clustering objectives, deep clustering learns more semantic and noise-robust embeddings, demonstrating good performance in applications such as image understanding, anomaly detection, and large-scale retrieval [Ren et al., 2024].

Based on diverse network architectures, such as autoencoders (AE) [Xie et al., 2016;Guo et al., 2017a], variational autoencoders (VAE) [Jiang et al., 2016;Yang et al., 2019b], and generative adversarial networks (GAN) [Mukherjee et al., 2019], embedding-based deep clustering methods typically learn representations by jointly optimizing reconstruction and clustering objectives. Nevertheless, these methods ultimately rely on a single, fixed latent representation for clustering, which lacks the dynamic adaptability and struggles to align with structural variations of data across different semantic scales.

Diffusion models, a significant breakthrough in generative artificial intelligence, excel at precisely modeling complex data distributions and capturing multi-scale semantic representations. Through forward noising and reverse denoising, they construct hierarchical features across different timesteps. A recent study [Wang et al., 2024] further suggests that denoising representations can be spontaneously organized into multiple low-dimensional subspaces, naturally aligning with the needs of clustering. Yet, existing diffusionbased clustering research [Yan et al., 2025;Uziel et al., 2025;Yang et al., 2024] has yet to fully exploit this potential, and in particular lack systematic evaluation and effective selection of clustering-friendly representations across network layers and timesteps. Moreover, the high computational and time cost of diffusion inference remains a critical practical constraint.

Motivated by these insights, we propose Diffusion Embedded Clustering (DiEC), an unsupervised framework that leverages internal representations of a pretrained diffusion U-Net for clustering. DiEC views the layer×timestep space as a representation trajectory and performs unsupervised evaluation to identify clustering-optimal readouts, namely the Clustering-Optimal Layer (COL) and the Clustering-Optimal Timestep (COT). To alleviate interference between denoising and clustering objectives, DiEC introduces residual feature decoupling for effective task adaptation. For optimization, DiEC couples a DEC-style KL self-training objective with graph regularization to strengthen structural consistency, while retaining a random-timestep denoising reconstruction objective to preserve generative capability. Extensive experiments and ablations studies demonstrate that DiEC achieves excellent clustering performance on multiple benchmarks without relying on augmentation-based consistency constraints or contrastive learning. Our contributions are:

• DiEC provides a novel insight into diffusion-based clustering, revealing that effective feature selection within the multi-scale semantic representations of diffusion models is crucial for improving clustering performance.

• DiEC further develops an efficient, cost-aware, and label-free search strategy for recognizing clusteringfriendly COL + COT in the layer×timestep space of pretrained diffusion models, aiming to promote clustering performance and reduce computational overhead.

• DiEC is fine-tuned primarily with a structure-preserving DEC-style KL-divergence clustering objective at the fixed COL + COT, together with a random-timestep diffusion denoising objective to maintain the generative capability of pretrained diffusion models. et al., 2025;Qiu et al., 2024;Zhu et al., 2025]. Third, diffusion-driven clustering has also been explored in practical application scenarios, such as hyperspectral imaging and ultrasound enhancement [Chen et al., 2023;Chen et al., 2025]. In contrast to prior diffusion clustering methods, we focus on systematically identifying clusteringfriendly readouts from the intrinsic multi-scale representations of pretrained diffusion models, aiming to improve clustering quality while reducing computational overhead.

3 Background Diffusion models. The core idea of diffusion models is to formulate generation as iterative denoising: a forward process gradually corrupts data with Gaussian noise, and a learned reverse process progressively removes noise to generate samples from the data distribution.

In DDPM [Ho et al., 2020], the forward noising process is defined as a Markov chain that gradually corrupts a data sample x 0 ∼ q data (x):

where {β t } T t=1 is a predefined noise schedule and t ∈ {1, . . . , T } denotes the diffusion timestep.

Let α t = 1 -β t and ᾱt = t s=1 α s . Then, the marginal distribution of x t admits a closed-form expression

which shows that the noise level of x t is fully governed by the timestep t. The reverse denoising process is parameterized as a Gaussian transition

where the mean is expressed via a noise-prediction network ϵ θ as

and the variance is fixed to the posterior value as

In practice, the noise prediction network ϵ θ (x t , t) predicts the forward-process noise ϵ and is trained with a simple meansquared error objective:

The above MSE objective can be interpreted as a simplified variational bound on the data likelihood. At the sampling stage, starting from x T ∼ N (0, I), the reverse transition p θ (x t-1 | x t ) is applied to progressively denoise samples and generate data.

DDPMs use discrete timesteps, while later work introduced continuous-time variants and an SDE-based view [Song et al., 2020]. Recent studies have also explored Vision Transformers (ViTs) as diffusion backbones [Dosovitskiy, 2020;Peebles and Xie, 2023]. In this paper, we follow the discrete-time DDPM formulation with a U-Net denoiser to match our timestep-selection analysis. Deep Embedded Clustering. Deep Embedded Clustering (DEC) is a method that simultaneously learns feature representations and cluster assignments [Xie et al., 2016]. Given a set of embeddings {z i } N i=1 and cluster centroids {µ k } K k=1 , DEC defines the soft assignment via a Student’s t-distribution kernel as

To emphasize high-confidence assignments and mitigate cluster-size imbalance, DEC constructs a target distribution

DEC then minimizes the KL divergence

for self-training, progressively sharpening the soft assignments and improving clustering consistency.

To leverage the multi-scale representations of diffusion models for clustering, DiEC first identifies the COL + COT by searching the layer×timestep space of pretrained diffusion models, which guide the extraction, as illustrated in Fig. 1. Then, following Fig. 2, DiEC learns the embedding representations at the selected COL + COT through a shared pretrained U-Net backbone, and decouples the clustering representations via a lightweight residual network. Finally, DiEC performs clustering based on a DEC-style KL-divergence objective, supplemented by a random-timestep diffusion denoising objective to maintain the generative capability of pretrained diffusion models.

Unsupervised Clusterability Metric. In the U-Net-based diffusion models, multi-scale semantic representations are indexed by both the network layer l and the noise timestep t.

Given an unlabeled dataset D = {x 0,i } N i=1 and the number of clusters K, x t,i denotes the noisy observation of the clean sample x 0,i at diffusion timestep t. Then, we can obtain the embedding representations E l,t = {e l i (t)} N i=1 on U-Net layer l at timestep t as

The clusterability of these representations can be evaluated using the Scott Score, denoted as SS(l, t), an unsupervised metric applied to assess clustering performance on data without ground-truth labels.

We run k-means on E l,t = {e l i (t)} N i=1 to yield cluster assignments C l k (t) and centroids µ l k (t), and define the withincluster scatter and the between-cluster scatter as 1.8

•10 5

SSSm(l, t) 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 6 0 6 5 7 0 7 5 8 0 8 5 9 0 9 5 1 0 Right: ground-truth Smoothed Accuracy Score(ACCSm). The consistent alignment between the two metrics validates the COL selected by SSSm. On this dataset, it is located at the bottleneck layer.

where ē l (t) = 1 N N i=1 e l i (t) denotes the global mean embedding across all samples.

Then, the Scott Score is defined as

where

To reduce noise, we smooth the Scott Score along timesteps by a centered moving average as

where N (t) = τ ∈ T |τ -t| ≤ ⌊w/2⌋ , T is the set of evaluated timesteps and w is the window size. In DiEC, this smoothed Scott Score SS Sm (l, t) serves as the unsupervised internal criterion for searching clusteringfriendly representations. Optimal Search for COL and COT. Based on the smoothed Scott Score, we design an Optimal Search strategy for identifying the COL and COT, denoted as l * and t * respectively. This strategy consists of two sequential stages: layer selection, followed by timestep selection.

In layer selection, considering the dimensional differences between representations across layers in the U-Net, we apply a PCA-based layer-wise alignment map A l (•) to obtain aligned embeddings E l,t = {ẽ l i (t)} N i=1 , where ẽ l i (t) = A l (e l i (t)), and evaluate its smoothed Scott Score as SS Sm (l, t).

To obtain the robust evaluation for each layer, the timesteps ranked in the top-ρ percent (e.g. ρ=20) are utilized to identify the COL as where T ρ l = t ∈ T rank t SS Sm (l, t) ≤ ⌈ρ|T |⌉ . From Fig. 3, we can see that the COL is located at the bottleneck layer on the USPS dataset (left panel). At the COL, the unsupervised Scott Score (SS) computed from the PCA-aligned inputs and the ground-truth clustering accuracy (ACC) exhibit similar trends. This observation confirms the effectiveness of the smoothed Scott Score as a reliable metric for identifying the COL.

In timestep selection, with l * fixed, we evaluate clusterability directly on the native representations to remain consistent with training. The COT is then identified by

Fig. 4 shows the trajectory alignment between the unsupervised Scott Score (SS) and ground-truth clustering accuracy (ACC) across diffusion timesteps on the MNIST dataset. We also plot their smoothed values using the centered moving average mentioned above. It can be clearly seen that SS exhibits the same trend as ACC. In particular, by removing the effect of noise, their smoothed values are perfectly matched, Sample noise ϵ ∼ N (0, I) and generate x t * ,i .

Extract features e i = h l * θ (x t * ,i , t * ).

14:

Compute residual embedding (Eq. 18).

Compute soft assignments Q (Eq. 7) and clustering loss L KL (Eq. 9).

Compute graph constraint loss L Gr , L En (Eq 21, 22). 17:

// Branch B: Diffusion Consistency Sample random timesteps t r ∼ U({1, . . . , T }).

19:

Generate noisy inputs x tr,i .

Compute denoising loss L Re (Eq. 23).

// Joint Update

Update S via Lagrange multipliers.

L total = L Re + αL KL + βL Gr + γL En .

Update θ, ϕ, {µ k } via gradient descent: ∇L total .

end for 26: end for 27: Output: Cluster centroids {µ k }, assignments Q.

all reaching the maximum at the same timestep. This demonstrates that the smoothed Scott Score is a suitable evaluation metric for locating COT.

The complete procedure is detailed in Section A of the Appendix. Specifically, to mitigate diffusion stochasticity and reduce computational overhead, we estimate scores on a sample subset and average them over several independent noise realizations. This strategy substantially accelerates inference while maintaining high clustering accuracy.

Given the features e l * i (t * ) extracted at (l * , t * ), we obtain clustering representations z i through a lightweight residual mapping, and optimize a DEC-style KL self-training objective with graph regularization to strengthen cluster structure. Meanwhile, we add a standard denoising loss at a random timestep t r , which helps maintain diffusion consistency and stabilize the representations.

Residual-decoupled embedding. To simplify the notation, we refer to e l * i (t * ) as e i , where

To reduce potential interference between the denoising objective and the clustering objective, we adopt a lightweight residual mapping g ϕ (•) (a two-layer ReLU MLP with matched input-output dimensions) to decouple clustering adaptation from the pretrained representation. The final clustering embedding is defined as

(

This residual branch partially decouples clustering from denoising and enhances the expressiveness of the clustering loss. Clustering objective. Given the clustering embeddings {z i } N i=1 , we adopt a DEC-style KL-divergence self-training objective. Since diffusion noising introduces stochasticity that can affect centroid initialization, we compute the mean embedding by averaging over M independent noise realizations at the selected layer l * and timestep t * as

Then, k-means is applied to {z i } N i=1 to initialize the centroids {µ k } K k=1 . Based on {z i } N i=1 and {µ k } K k=1 , we compute the soft assignments Q via Eq. ( 7) and the target distribution P via Eq. ( 8). The clustering objective is then defined as their KL divergence in Eq. ( 9). Adaptive graph regularization. To preserve local consistency in the embedding space, we introduce an adaptive graph regularization term. Specifically, we construct a rownormalized affinity matrix S = [s ij ] based on the k-NN neighborhood NB i of each sample, with We impose a graph constraint on the soft assignments Q to strengthen the cluster structure as

In addition, we introduce an entropy regularizer to prevent trivial solutions, formulated as

Joint training with diffusion denoising. The update of the shared U-Net network only at timestep t * may weaken its denoising performance at other timesteps, thereby impairing the overall stability of diffusion representations. Thus we introduce an additional noise-reconstruction branch at random timesteps t r to maintain generative capability. Specifically, we sample t r ∼ U({1, . . . , T }) in each iteration and apply the standard forward noising to obtain x tr,i , and the noise-prediction MSE objective is then utilized to preserve the model’s generative capability, as defined by

As shown in Fig. 5, L Re prevents drift during optimization and helps preserve generative capability across timesteps. Furthermore, Appendix Section B provides qualitative results that confirm the maintained generative quality.

We optimize the diffusion U-Net and the residual MLP module end-to-end with a weighted combination of objectives. The overall training objective is

where L Re is the diffusion-consistency denoising reconstruction loss at random timesteps, L KL is the DEC-style clustering loss, L Gr and L En are the adaptive regularization terms. α, β, and γ are weighting hyperparameters that balance the relative contributions of the loss terms. The complete training procedure is provided in Algorithm 1.

Datasets. We evaluate DiEC on four widely used benchmarks: MNIST, USPS, Fashion-MNIST, and CIFAR-10. MNIST and Fashion-MNIST each comprise 70,000 grayscale images at 28 × 28 resolution. USPS contains 9,298 grayscale images at 16 × 16 resolution, and CIFAR-10 consists of 60,000 color images at 32 × 32 resolution. The dataset statistics are summarized in Table 1. Evaluation metrics. We evaluate clustering performance using clustering accuracy (ACC), normalized mutual information (NMI), and adjusted Rand index (ARI). Higher values indicate better clustering quality. 3 reports the contribution of each component on the USPS dataset. First, compared with randomly selection of a layer and timestep (ID 1), the identified COL (ID 2) provides a 4.6% improvement in ACC. Next, by fixing the timestep to COT, ID 3 brings a large improvement in clustering performance, resulting in an ACC of 96.83%. This confirms the importance of layer and timestep selection, and the effectiveness of the smoothed Scott Score serving as a reliable evaluation metric for locating COL and COT. Then, the subsequent integration of KL self-training, residual decoupling, and adaptive graph regularization (ID 4-6) further refines the feature space, thereby enhancing the clustering performance. Finally, the employment of random-timestep diffusion consistency (ID 7) further refines the representation, culminating in the performance of 98.49% in ACC, 96.95% in NMI, and 95.65% in ARI.

In

Optimal Search adopts a two-stage strategy. In both stages, we evaluate candidate representations on a sampled subset and reduce diffusion stochasticity by averaging over multiple noise trials, as shown in Algorithm A1.

In Stage 1, we sample a subset D s and evaluate each layer over a timestep set T . We then smooth the scores across timesteps and compute a layer-level score by averaging the top-ρ fraction of each layer’s timestep-level scores, which is used to select the COL.

In Stage 2, we fix the selected COL and evaluate the same timestep set T to identify the COT, again using smoothed scores for robust selection, and we stop early if the score does not improve for P consecutive timesteps.

We evaluated the generative capability of the proposed DiEC on CIFAR-10. Fig. A1 presents images generated by the pretrained and fine-tuned models from the same set of samples. Both models exhibit high image fidelity and diversity. Notably, the images generated by the fine-tuned DiEC contain more details, which validates the effectiveness of our reconstruction objective L Re in maintaining generative consistency.

∆ (Full -Baseline) -

  • ,t ← {ē i (t)} x0,i∈Ds 28: SS(l * , t) ← Scott E l * ,t 29: SS Sm (l * , t) ← OnlineMovAvg(SS(l * , t), t, w) 30: if SS Sm (l * , t) > S max then 31: S max ← SS Sm (l * , t); t * ← t * , t * )

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut