CleanDIFT: Diffusion Features without Noise

CleanDIFT: Diffusion Features without Noise
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.


💡 Research Summary

CleanDIFT addresses a fundamental limitation of diffusion‑based visual feature extraction: the need to add Gaussian noise to input images and to select a specific diffusion timestep t for each downstream task. Existing methods treat a diffusion model as a family of T feature extractors, each operating on a differently noised version of the image. While this yields task‑specific representations, it also discards a large portion of the original image information because the added noise itself dominates the feature variance, especially at the timesteps commonly used in practice (e.g., t = 261 in DIFT). The authors empirically demonstrate that a substantial fraction of the variance of diffusion features can be explained by the noise component alone, confirming that current pipelines are extracting noisy artefacts rather than pure image semantics.

To overcome this, CleanDIFT proposes a lightweight, unsupervised fine‑tuning procedure that converts a pre‑trained diffusion backbone (Stable Diffusion 1.5 or 2.1) into a noise‑free, timestep‑independent feature extractor. The method proceeds as follows: (1) a trainable copy of the diffusion U‑Net is created; this copy receives the clean image x₀ as input, while the frozen original diffusion model receives the same image corrupted with Gaussian noise ε at a sampled timestep t. (2) For each decoder layer k, a timestep‑conditioned projection head projₖ(·, t) is attached to the copy. During training, the projected features projₖ(featₖᶜ(x₀), t) are forced to align with the original model’s noisy features featₖ(x_t, t) by minimizing the negative cosine similarity across a stratified set of timesteps. (3) The loss aggregates this alignment over all selected layers, encouraging the copy to learn a single representation featᶜ(x₀) that simultaneously matches the entire spectrum of noisy features. (4) After training (≈400 optimization steps, ~30 minutes on a single A100), the projection heads are discarded; the internal activations of the copy are used directly as diffusion features.

This approach eliminates two costly design choices: (i) the explicit addition of noise, which reduces the information content of the input, and (ii) the need to tune a timestep per task, which previously required exhaustive search or ensembling across multiple timesteps. Consequently, inference becomes up to eight times faster than ensemble‑based baselines, with negligible extra memory overhead.

The authors evaluate CleanDIFT on four representative vision tasks: (a) zero‑shot unsupervised semantic correspondence, (b) monocular depth estimation, (c) semantic and panoptic segmentation, and (d) image classification. Across all benchmarks, CleanDIFT consistently outperforms standard diffusion features and even surpasses state‑of‑the‑art methods that rely on timestep ensembling. Notably, in semantic correspondence, CleanDIFT sets a new SOTA, improving average precision by 7–10 percentage points over DIFT. In depth estimation and segmentation, it reduces error metrics by roughly 10–12 % and raises mIoU by 2–3 %, respectively. For classification, a linear probe on the extracted features yields a 1.5 % gain in top‑1 accuracy. Moreover, when combined with DINOv2 or other self‑supervised representations, CleanDIFT’s features provide complementary information, leading to further modest improvements.

Ablation studies confirm that (i) the projection heads are essential during training but unnecessary at test time, (ii) aligning across a wide range of timesteps is crucial for achieving timestep independence, and (iii) the method’s performance plateaus after the brief 30‑minute fine‑tuning, indicating that the approach is both efficient and robust.

In summary, CleanDIFT reframes diffusion models from generative engines into versatile, noise‑free feature extractors. By learning to align clean‑image representations with the full noisy feature distribution, it recovers the rich world knowledge embedded in large diffusion backbones without sacrificing image fidelity or requiring task‑specific hyper‑parameter tuning. The work opens avenues for scaling the technique to larger diffusion models, extending it to video and audio modalities, and integrating it with other self‑supervised paradigms, thereby broadening the practical impact of diffusion‑based representations in computer vision.


Comments & Academic Discussion

Loading comments...

Leave a Comment