CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation

CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB$\leftrightarrow$RGB and diverse cross-modality translation tasks including RGB$\leftrightarrow$Edge, RGB$\leftrightarrow$Semantics and RGB$\leftrightarrow$Depth, showcasing better generative performances than the state of the arts.


💡 Research Summary

CycleDiff tackles the challenging problem of unpaired image‑to‑image translation by jointly learning diffusion models and a cycle‑consistent translator. Traditional GAN‑based approaches treat translation as a single‑step mapping, which often leads to mode collapse and insufficient coverage of the target distribution. Recent diffusion‑based methods either revise the sampling equation or train the diffusion and translation modules separately, resulting in misaligned objectives and sub‑optimal performance.

The core contribution of CycleDiff is two‑fold. First, it introduces the notion of an “image component” – the gradient of the image attenuation process (denoted Cₛₜ for domain S at time step t) predicted by a denoising U‑Net. Unlike the noisy estimates of conventional diffusion models, this component approximates the clean image and can be fed directly into a translation network. Second, it proposes a time‑dependent translation network (Gϕ for S→T and Fψ for T→S) that incorporates the current diffusion time step t as an explicit conditioning signal. By embedding time information into the feature space, the translator learns a distinct mapping for each denoising step, effectively turning the translation into a multi‑step process that aligns with the iterative nature of diffusion.

The overall architecture consists of two domain‑specific diffusion models and a shared cycle translator. During training, a noisy sample xₜ from either domain is processed by its diffusion network, which outputs both the predicted noise ϵ̂ and the image component Cₛₜ. The component is then passed through Gϕ (or Fψ) together with the time embedding, producing a translated component in the opposite domain. A second translation brings it back, and a cycle‑consistency loss enforces Cₛₜ ≈ Fψ(Gϕ(Cₛₜ, t), t). In parallel, standard diffusion reconstruction losses (L₂ on the predicted clean image and on the noise) are applied. An optional perceptual loss (e.g., VGG‑based) further encourages high‑frequency fidelity. All losses are summed into a single objective and optimized end‑to‑end, allowing the diffusion and translation modules to co‑adapt.

Experiments cover four major translation settings: (1) RGB↔RGB (including style swaps such as Cat↔Dog, Summer↔Winter, etc.), (2) RGB↔Edge, (3) RGB↔Semantic segmentation maps, and (4) RGB↔Depth maps. Quantitative metrics (FID, IS, LPIPS, mIoU for semantic tasks, RMSE for depth) consistently show that CycleDiff outperforms state‑of‑the‑art GAN‑based methods (CycleGAN, MUNIT, UNIT) and recent diffusion‑based approaches (EGSDE, CycleDiffusion, UNIT‑DDPM, SynDiff). Notably, on the Dog→Cat and Dog→Wild benchmarks, CycleDiff improves FID by 19.6 and 19.7 points respectively over the second‑best method. Qualitative results demonstrate superior preservation of structure (e.g., pose, layout) while delivering photorealistic textures.

Ablation studies confirm the importance of the two key designs. Removing the image component extraction (i.e., feeding raw noisy images to the translator) degrades performance dramatically, as the translation network receives ambiguous inputs. Replacing the time‑dependent translator with a static one also leads to a significant drop in FID and higher cycle‑consistency loss, highlighting that each diffusion step requires a tailored mapping.

From an efficiency standpoint, CycleDiff retains a parameter count comparable to existing diffusion‑based translators (≈150 M) and can be accelerated using DDIM‑style sampling, reducing inference steps from 1000 to around 200 without sacrificing quality. This makes the method viable for near‑real‑time applications.

In summary, CycleDiff presents a unified framework that aligns diffusion and translation through image‑component extraction and time‑conditioned multi‑step mapping, achieving global optimality and structural consistency absent in prior work. The approach opens avenues for high‑resolution, multi‑domain, and text‑conditioned translation extensions, positioning diffusion models as a powerful alternative to GANs for unpaired image‑to‑image translation.


Comments & Academic Discussion

Loading comments...

Leave a Comment