DeepInv A Novel Self-supervised Learning Approach for Fast and Accurate Diffusion Inversion

Reading time: 24 minute
...

📝 Original Paper Info

- Title: DeepInv A Novel Self-supervised Learning Approach for Fast and Accurate Diffusion Inversion
- ArXiv ID: 2601.01487
- Date: 2026-01-04
- Authors: Ziyue Zhang, Luxi Lin, Xiaolin Hu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji

📝 Abstract

Diffusion inversion is a task of recovering the noise of an image in a diffusion model, which is vital for controllable diffusion image editing. At present, diffusion inversion still remains a challenging task due to the lack of viable supervision signals. Thus, most existing methods resort to approximation-based solutions, which however are often at the cost of performance or efficiency. To remedy these shortcomings, we propose a novel self-supervised diffusion inversion approach in this paper, termed Deep Inversion (DeepInv). Instead of requiring ground-truth noise annotations, we introduce a self-supervised objective as well as a data augmentation strategy to generate high-quality pseudo noises from real images without manual intervention. Based on these two innovative designs, DeepInv is also equipped with an iterative and multi-scale training regime to train a parameterized inversion solver, thereby achieving the fast and accurate image-to-noise mapping. To the best of our knowledge, this is the first attempt of presenting a trainable solver to predict inversion noise step by step. The extensive experiments show that our DeepInv can achieve much better performance and inference speed than the compared methods, e.g., +40.435% SSIM than EasyInv and +9887.5% speed than ReNoise on COCO dataset. Moreover, our careful designs of trainable solvers can also provide insights to the community. Codes and model parameters will be released in https://github.com/potato-kitty/DeepInv.

💡 Summary & Analysis

1. **Self-Supervised Diffusion Inversion Approach** - **Simple Explanation**: This study introduces a new approach called DeepInv to improve the inverse process in image generation models. It directly predicts noise from images for faster and more accurate results. - **Metaphor**: The inversion process is like reassembling a complex puzzle after it has been disassembled; DeepInv provides a shortcut that makes this task easier.
  1. Parameterized Inversion Solvers for SD3 and Flux

    • Simple Explanation: This research proposes specially designed inversion methods for two important image generation models, SD3 and Flux, leading to significantly faster and more accurate results than existing methods.
    • Metaphor: It’s like having a specialized toolkit that can be adjusted for different projects. Each model has its own unique tools which make the task more effective.
  2. Efficient Self-Supervised Learning Method

    • Simple Explanation: The study proposes an innovative self-supervised learning method to improve the efficiency of the inversion process, allowing models to learn and improve autonomously.
    • Metaphor: This approach is like a smart student who can automatically identify problems and find solutions. With this ability, tasks are completed faster and more effectively.

📄 Full Paper Content (ArXiv Source)

style="width:100.0%" />
Illustrations of our DeepInv and previous inversion strategies. (a) Iterative optimization methods  update the inversion noise at each timestep, but requiring excessive time. (b) Ordinary differential equation (ODE) based approaches  design invertible sampler with better efficiency, but are often at the cost of reconstruction quality. (c) DeepInv is the first approach to train an step-by-step inversion solver to fast and accurately predict inversion noise.
style="width:100.0%" />
The comparison between our DeepInv and existing SOTA methods, i.e., EasyInv , FTEdit  and DVRF , in terms of image inversion (a) and editing (b). (a) shows the visualization of image inversions, which shows that our DeepInv can better preserve the details of the original images than EasyInv, e.g., the textures and structure. (b) shows the image editing results according to the text prompt. The inverted noise predict by DeepInv can help FTEdit achieve more better editing results well aligned to the text prompt, while the compared methods are easy to fail in alignments.

Introduction

Recent years has witnessed the great breakthroughs made by image diffusion models in the field of image generation . Representative diffusion models, such as Stable Diffusion series  and DALL-E3 , exhibit much superior capabilities than previous generative approaches . In terms of diffusion-based image editing, a critical application of diffusion models , diffusion inversion is a research hot-spot that attracts increasing attention from both academia and industry . This task aims to achieve a mapping between the noise space and the real images, based on which the diffusion models can achieve accurate and controllable image generation or manipulation . Despite the progress, existing diffusion inversion methods are still hard to achieve satisfactory results in both performance and efficiency simultaneously due to lack of supervisions. In particular, a key challenge of diffusion inversion is the absence of ground-truth annotations , which are intractable to obtain. In this case, conventional supervised learning paradigms for image inversion are hard to implement, and the researchers have to shift towards approximation-based strategies, such as the iterative optimization  and ordinary differential equation (ODE) based ones , as shown in Fig. 1. While these efforts have achieved remarkable progresses, they still encounter several key issues, such as computation complexity , inversion instability  or low reconstruction quality . As shown in Fig.2, although the SOTA method EasyInv can greatly shorten the reconstruction process, it is still hard to handle the images with complex textures or structure details. Besides, on downstream tasks, inversion-free approach DVRF  results in inconsistency across non-editing area, and FTEdit , as a inversion-based method, generate a money with four eyes. Overall, achieving the fast, stable and high-quality diffusion inversion still remains an open problem.

To overcome these challenges we propose a novel self-supervised diffusion inversion approach in this paper, termed Deep Inversion (DeepInv). The main principle of DeepInv is to explore effective self-supervision signals to directly train an parameterized solver, thereby achieving efficient and effective diffusion inversion. To approach this target, DeepInv first resort to fixed-point iteration theory  to automatically derives pseudo noise annotations from real images, of which procedures requires no manual intervention. Besides, a novel data augmentation strategy based on linear interpolation is also proposed to fully exploit the pseudo information from limited real images, and these high-quality pseudo labels are used as the self-supervision signals. Based on the above designs, DeepInv further introduce a novel training regime to integrate pseudo label generation and self-supervised learning in one unified framework, where a multi-scale learning principle is also adopted to progressively improves the solver’s capability for diffusion inversion. With these innovative designs, DeepInv can effectively train a parameterized diffusion solver to directly predict the noise of the given images, improving the efficiency and performance by orders of magnitudes. Moreover, we also carefully design two trainable solver networks for SD3  and Flux , respectively, which can also enlighten the research fo community.

To validate DeepInv, we conduct extensive experiments on the COCO  and PIE-Bench  benchmarks. The experimental results show the comprehensive improvements of DeepInv than existing methods in terms of inversion efficiency and quality, e.g., +200% SSIM than ReNoise  on COCO dataset  with +9887.5% faster speed. Besides, in-depth analyses as well as downstream task performance further confirm the merits of DeepInv.

Overall, our contributions are two-fold:

  • We present the first self-supervised based approach for diffusion inversion, termed DeepInv, which can help to train parameterized inversion solvers for the fast and accurate mapping of inversion noises.

  • Based on DeepInv, we propose the first two parameterized inversion solver for SD3.5 and FLUX, respectively, yielding comprehensive improvements than existing inversion methods while providing insights to community.

style="width:100.0%" />
Overview of our proposed training framework. We begin by initializing a base network of the inversion solver. The training process proceeds through 4 iterative stages. In each iteration, the solver is trained under 5 different timestep configurations, and for each configuration, 2 rounds of optimization are performed using distinct loss functions. During the iteration, additional layers are appended and trained while the previously learned parameters are frozen. In the iteration (the last one), all parameters are jointly fine-tuned with a learning rate reduced to 10% of the original. Further implementation details are provided in Section Method.

Related Works

Diffusion models advance generative modeling by iteratively reversing the noise corruption process. DDPM  formalizes this with stable training but incurs high sampling costs. DDIM  later introduces a deterministic formulation that reduces sampling steps while maintaining quality. Recently, a series of novel designs have introduced the next generation of diffusion models , leading to further improvements in performance. Another key innovation in diffusion models is the introduction of rectified flow , which proposes a novel training paradigm to align the denoising directions across timesteps. By encouraging consistent noise prediction trajectories, this approach enables diffusion models to perform inference with fewer steps while maintaining output quality, thereby improving efficiency. However, these changes introduce new challenges on the need for dedicated inversion methods specifically tailored to rectified-flow-based models, which enable efficient and high-fidelity generation in modern diffusion architectures.

In particular, DDIM Inversion  represents one of the first attempts by introducing a reverse denoising mechanism to recover the noise of given images while preserving structural coherence. Null-Text Inversion  enhances reconstruction quality by decoupling conditional and unconditional textual embeddings, suggesting that empty prompts can have a positive impact on the inversion task. Another approach, PTI , refines prompt embeddings across denoising steps to achieve better alignment. ReNoise  applies cyclic optimization to iteratively refine noise estimates, while DirectInv  introduces latent trajectory regularization to mitigate inversion drift. More recently, Rectified Flow  reshapes the inversion landscape by enforcing consistent noise trajectories across timesteps, achieving high-quality generation with significantly reduced inference costs. This technique is adopted by advanced architectures such as Flux  and SD3 , prompting the development of new inversion methods. RF-Inversion  and RF-Solver  both focus on solving the rectified-flow ODE for inverting the denoising process, each proposing distinct solutions within this framework. Text-guided editing methods align textual and visual features for controlled generation. Prompt-to-Prompt  manipulates attention maps for attribute edits but requires full denoising. Later works such as Plug-and-Play  and TurboEdit  improve efficiency through latent blending and time-shift calibration, with TurboEdit  also introducing guidance scaling to enhance edit strength without compromising background consistency. However, many of these methods are not publicly available or require extensive training resources. In contrast, our method provides a lightweight and open-source alternative with competitive performance and efficiency.

Preliminary

Currently, diffusion inversion is often regarded as a fixed-point problem in existing research . For a given function $`f(x)`$, the general fixed-point equation can is

MATH
\begin{equation}
\label{eq:fixed}
x_0 = f(x_0),
\end{equation}
Click to expand and view more

where $`x_0`$ denotes the fixed point of $`f(x)`$. For diffusion inversion, we aim to find a mapping $`g(\cdot)`$ that transforms the latent codes between consecutive timesteps $`t`$

MATH
\begin{equation}
\label{eq:inv}
z_{t+1} = g(z_t).
\end{equation}
Click to expand and view more

Given an optimal latent code $`\bar z_t`$, its inversion $`\bar{z}_{t+1} = g(\bar{z})`$ obviously should satisfy the add-noise-denoising consistency condition , which means a good-quality inversion noise should be consistent with the denosing ones at the same time step:

MATH
\begin{equation}
\label{eq:deno}
\bar{z}_{t} = d(g(\bar{z}_{t})),
\end{equation}
Click to expand and view more

where $`d`$ represents the denoising process.

Let $`F = d \circ g`$ denote the composite function, then Eq.[eq:deno] can be transformed into a fixed-point form:

MATH
\begin{equation}
\label{eq:fixed_inv}
\bar{z}_{t} = F(\bar{z}_{t}).
\end{equation}
Click to expand and view more

Here, Eq.[eq:fixed_inv] is the principle rule we use for inversion tasks, laying the theoretical base for DeepInv.

style="width:100.0%" />
Architecture of the proposed DeepInv solver. The dual-branch design processes noise (left) and image (right) information separately before final aggregation. Extra blocks added in second last round to extend model ability.

Method

DeepInv

In this paper, we present a novel self-supervised learning regime for diffusion inversion, termed DeepInv, as depicted in Fig. 3. To enable the effective training of parameterized inversion solvers, DeepInv introduces a novel self-supervised learning design, considering the temporal consistency, progressive refinement and multi-scale training for diffusion inversion.

Overview

DeepInv operates on latent representations derived from a pretrained diffusion model. Given given an input image $`I`$, we first encodes it into a latent code $`z_0^*`$ through diffusion model’s original VAE:

MATH
\begin{equation}
\label{eq:vae}
z_0^* = \mathrm{VAE}(I).
\end{equation}
Click to expand and view more

Starting from $`z_0^*`$, the parameterized solver $`g^*(\cdot)`$ predicts the inverse noise step by step:

MATH
\begin{equation}
\label{eq:in-out}
g^*(z_t^*, t) = \varepsilon^*_t,
\end{equation}
Click to expand and view more

where $`\varepsilon^*_t`$ denotes the predicted inversion noise. It is added to latent $`z_t^*`$ by diffusion model’s original sampler $`s`$:

MATH
\begin{equation}
\label{eq:sampler}
s(z_t^*, -\varepsilon^*_t) = z_{t+1}^*.
\end{equation}
Click to expand and view more

Here, we consider $`-\varepsilon^*_t`$ as the opposite noise, since diffusion inversion is a process of adding noise.

Thus, the solver outputs temporally consistent estimations that can reconstruct the underlying clean latent $`z_0^*`$. In this case, the overall objective of DeepInv is to enable correctly and temporally coherent inversion across arbitrary diffusion models. Formally, the solver seeks to minimize the discrepancy between predicted noise/latents and their counterparts derived from the forward process, thereby approximating the true inverse dynamics. This objective can be formulated as

MATH
\begin{equation}
\label{eq:obj_inversion}
\underset{\phi}{min}~\mathbb{E}\Big[\Vert g_{\phi}^*(z_t^*, t) - d^*(z_{t+1}^*, t+1)\Vert_2^2\Big].
\end{equation}
Click to expand and view more

As we mentioned, when inverted noise is exactly same as the denoised one in the same timestep, it would be an ideal inversion, that is what this objective function targeting at. Considering Eq.[eq:obj_inversion] as the self-supervised objective, the corresponding loss function of DeepInv is defined by

MATH
\begin{equation}
\label{eq:l2_loss}
\begin{split}
    \mathcal{L}_{self} &= \Vert\varepsilon_t^*-\bar\varepsilon_t\Vert_2^2.
\end{split}
\end{equation}
Click to expand and view more

In particular, the reference noise $`\bar\varepsilon_t`$ is obtained by

MATH
\begin{equation}
\label{eq:denoising}
\bar\varepsilon_t = d^*(z^*_{t+1},t+1),
\end{equation}
Click to expand and view more

where $`d^*`$ represents the pretrained diffusion model, and $`z^*_{t+1}`$ is generated by adding noise $`\varepsilon_t^*`$ to $`z^*_t`$. With Eq.[eq:l2_loss], our DeepInv can perform the parameterized solver’s optimization without ground truth.

Besides, a hybrid loss is also used to incorporate pseudo labels generated from denoising-inversion fusion:

MATH
\begin{equation}
\label{eq:loss_mx}
\mathcal{L}_{hyb} = \Vert\varepsilon_t^*-\bar{\varepsilon}_t^*\Vert_2^2,
\end{equation}
Click to expand and view more

where $`\bar{\varepsilon}_t^*`$ represents the fused pseudo noise, which will be detailed later. During multi-scale fine-tuning stage, DeepInv employs a stabilized loss function defined by

MATH
\begin{equation}
\label{eq:loss_new}
\mathcal{L}_{stable} = \alpha\cdot\mathcal{L}_{self} + (1 - \alpha)\cdot\mathcal{L}_{hyb},
\end{equation}
Click to expand and view more

where $`\alpha`$ is the hyper-parameter to balance the smooth of optimization and over-fitting.

LPIPS ($`\downarrow`$) SSIM ($`\uparrow`$) PSNR ($`\uparrow`$) MSE ($`\downarrow`$) FID ($`\downarrow`$) Time ($`\downarrow`$)
DDIM Inversion  0.452 0.501 12.936 0.059 187.528 34s
RF Inversion  0.380 0.507 16.105 0.027 180.372 34s
ReNoise  0.614 0.451 12.152 0.071 250.882 4746s
EasyInv  0.294 0.643 18.576 0.021 153.333 34s
DeepInv Solver (Ours) 0.075 0.903 29.634 0.001 37.879 48s

Pseudo Label Generation

To further enhance inversion quality, we adopt a pseudo annotation strategy inspired by Null-Text Inversion . Instead of optimizing text embeddings via gradient descent, we directly fuse the denoising and inversion noise predictions, formulated as

MATH
\begin{equation}
\label{eq:add_noise}
\bar{\varepsilon}_t^* = 
\begin{cases}
\lambda_1 \cdot\bar{\varepsilon}_t + \lambda_2\cdot\varepsilon_t^*, & t \in ((1 - k)\cdot T,T] \\[4pt]
\bar{\varepsilon}_t. & t \in [0, (1 - k)\cdot T]
\end{cases}
\end{equation}
Click to expand and view more

Here, $`\bar{\varepsilon}_t`$ denotes the noise predicted by the diffusion model, which serves as the theoretical object for our inversion solver. However, in practice, $`\bar{\varepsilon}_t`$ tends to accumulate errors during the denoising process, leading to inaccurate estimations. To address this issue, we incorporate the inversion noise $`\varepsilon^*_t`$ generated by our solver $`g^*(\cdot)`$, combining it with $`\bar{\varepsilon}_t`$ for more stable learning. As demonstrated in Null-text inversion , aligning the predicted denoising noise with the inversion noise effectively enhances inversion accuracy and reconstruction fidelity. The coefficients $`\lambda_1`$ and $`\lambda_2`$ are introduced to balance the relative contributions of these two noise components in the optimization objective. This linear fusion captures both high-confidence denoising outputs and transitional inversion predictions, based on which we further construct a synthesized dataset $`\mathcal{Q}`$ that provides stable and high-quality pseudo supervision. Compared to gradient-based optimization, this mechanism achieves equivalent alignment quality with significantly improved computational efficiency.

Training Procedure

In terms of training process, DeepInv adopts an iterative and multi-scale training regime to progressively refine inversion accuracy and temporal coherence, as depicted in Fig. 3.

Iterative Learning. The training process is divided into multiple temporal stages $`\mathcal{T} = \{1, 5, 10, 25, 50\}`$, where the numbers denote the timestep for each round of training. Each stage captures a distinct temporal resolution, considering the diffusion process as a vector field . In DeepInv, low resolutions training serves to capture global trajectory patterns, while the higher ones focus on local details. The training begins with a low-resolution warm-up ($`\mathcal{T}_0 = 1`$), and proceeds by progressively increasing temporal resolution, with parameters inherited between stages. The final stage ($`\mathcal{T}_4 = 50`$) aims to obtain full temporal coherence. In practice, $`\mathcal{T}`$ could be changed, e.g., FLUX-Kontext  designs to generate image in fewer steps, while each of them takes long time. In this case, $`\mathcal{T}`$ could be set to $`\{1, 5, 10\}`$.

Multi-scale Tuning. To balance efficiency and capacity, the model depth scales with the value of $`k`$. A 5-layer configuration is used for $`k \in [0.8, 0.6]`$, and the depth increases to 9 layers when $`k=0.5`$. As illustrated in Fig. 4, new layers are appended to the right branch with residual connections for stability. Fine-tuning proceeds in two iterations: (1) only the newly added layers are optimized, while others remain frozen; (2) all parameters are fine-tuned with a reduced learning rate (10% of the original). This progressive scheme allows DeepInv to preserve previously learned inversion knowledge while continuously improving reconstruction fidelity across scales.

DeepInv Solver

Based on DeepInv, we further propose innovative parameterized solvers for existing diffusion model , which aims to exploit the pre-trained diffusion knowledge and image-specific cues through a dual branch architecture. Here, we use Stable Diffusion 3 as the base model, showing the construction of DeepInv’s solver. The other solver for Flux are depicted in appendix due to the page limit, which also follows the same principle. Concretely, DeepInv Solver is built based on the same components of SD3 with additional refinement modules for noise correction, as shown in Fig. 4. This design maintains the information of DDIM inversion noise , while improving accuracy through residuals.

Input of DeepInv Solver. The diffusion model takes the latent $`z_t^*`$, timestep $`t`$, and prompt $`\omega`$ as inputs to predict the corresponding noise. For our inversion solver $`g^*(\cdot)`$, to achieve higher reconstruction fidelity, we additionally introduce the DDIM inversion noise $`\tilde{\varepsilon}_t`$ as an auxiliary input:

MATH
\begin{equation}
\tilde{\varepsilon}_t = \tilde{d}(z_t^*, t),
\end{equation}
Click to expand and view more

where $`\tilde{d}`$ denotes the DDIM inversion operator, and $`z_t^*`$ represents the denoised latent at timestep $`t`$. By incorporating DDIM inversion noise $`\tilde{\varepsilon}_t`$, which serves as a well-established baseline in inversion, the solver gains a strong initialization prior. Moreover, residual connections are applied between the output of our solver and $`\tilde{\varepsilon}_t`$ as well as the modules, ensuring that the final performance will not fall below the baseline.

Dual-Branch Design. A key innovation of the DeepInv Solver lies in its dual-branch architecture, which separates pretrained prior modeling (left branch) from image-conditioned refinement (right branch). This structural disentanglement enables a balanced trade-off between inversion fidelity and structural consistency, allowing the model to achieve high-quality reconstruction without requiring explicit ground-truth noise supervision.

Among the four types of inputs to our solver, the latent $`z_t^*`$ is fed into the right branch to facilitate image-guided refinement, complementary to the left branch, as shown in Fig. 4. The prompt embedding $`\omega`$ is sent to the left branch but is typically set to an empty token, following the strategy of Null-Text Inversion , which shows that empty prompt conditioning enhances inversion fidelity. Meanwhile, the DDIM inversion noise $`\tilde{\varepsilon}_t`$ serves as a shared input to both branches, providing a high quality prior across the model.

Another novel design is the use of two distinct timestep embeddings, reflecting their respective branch functionalities. The left-branch embedding $`t_1`$ follows the original SD3 temporal encoding, constructed from SD3’s temporal embedding module together with $`\omega`$. It also keeps the full compatibility with the pretrained model, as shown in Fig. 4. For the right branch, the timestep embedding $`t_2`$ is defined by

MATH
\begin{equation}
t_2 = \mathrm{TEMB}(z_0^*, t),
\end{equation}
Click to expand and view more

where $`z_0^*`$ is used in place of the prompt embedding to preserve visual coherence and retain original image information during the inversion process. Both branches consist of stacked MM-DiT blocks , and the conditional vectors $`t_1`$ and $`t_2`$ remain consistent in their respective branches. Residual connections are employed both between the blocks and at the final output, which can retain the prior information from DDIM inversion, and progressive refinement through layers. After feature extraction, the two branches are fused through an MM-DiT aggregation block followed by a linear projection layer to generate the predicted noise. Finally, residual addition is applied between the predicted noise and $`\tilde{\varepsilon}_t`$, yielding the final output. Details of the Flux solver can refer to appendix.

LPIPS ($`\downarrow`$) SSIM ($`\uparrow`$) PSNR ($`\uparrow`$) MSE ($`\downarrow`$) FID ($`\downarrow`$)
FTEdit  0.078 0.90 25.117 0.004 44.423
FTEdit + DeepInv Solver 0.087 0.90 26.255 0.003 41.158
RF Inversion  0.211 0.71 19.855 0.014 67.787
RF Inversion + DeepInv Solver 0.111 0.86 24.519 0.005 54.056
DVRF  0.093 0.85 23.372 0.007 56.698

+Noise Interpolation $`k`$ LPIPS ($`\downarrow`$) SSIM ($`\uparrow`$) PSNR ($`\uparrow`$) MSE ($`\downarrow`$) FID ($`\downarrow`$)
DDIM Inversion 0.5 0.245 0.785 23.347 0.005 108.718
EasyInv 0.5 0.192 0.745 25.042 0.004 103.709
DeepInv Solver (Ours) 0.5 0.075 0.903 29.634 0.001 37.879

The impact of noise interpolation strategy, for different inversion methods. With same strategy, our approach performed the best. The base model used is SD3.

Layers Branch LPIPS ($`\downarrow`$) SSIM ($`\uparrow`$) PSNR ($`\uparrow`$) MSE ($`\downarrow`$) FID ($`\downarrow`$)
5 None 0.076 0.900 28.652 0.002 38.106
9 Both 0.076 0.903 29.563 0.001 38.188
9 Right 0.075 0.903 29.634 0.001 37.879

Ablation study of the number of adding layers to DeepInv Solver. The base model used is SD3.

Experiments

We validate the effectiveness of our method through comprehensive experiments on the COCO and PIE-Bench benchmarks, and compare it with a set of advanced diffusion inversion methods as well as the baseline method, including DDIM inversion , EasyInv , ReNoise  and RF-Inversion .

Experimental Settings

Implementation details

To ensure a fair comparison, each inversion method is first applied to estimate the noise corresponding to input images. The resulting noise is then fed into the SD3 model to reconstruct the images, which are subsequently used for evaluation. We run each method for one time. Experiments are conduct on three NVIDIA GeForce RTX 3090 GPUs, with 24GB usable memory each. For some of these methods are not public or does not support SD3 inversion, we re-implement them for following experiments. For hyper-parameters, we have $`\lambda_1=\lambda_2=0.5`$, and $`k\in [0.8,0.6,0.5,0.5]`$. Our batch size is set to 4 due to the limitation of GPU’s saving space. We also have a dynamical training epoch setting, $`epoch \in [300,300,250,200,100]`$, corresponding to the temporal stages $`\mathcal{T}`$. The feature dimension of MM-DiT blocks of our solver are same as their original setting in SD3 model.

Datasets and metrics

We use two mainstream benchmarks. COCO  is a large-scale image collection originally created for object detection, segmentation and captioning: it contains over 300,000 images across 80 object categories and more than 2.5 million labeled instances. The PIE-Bench  is a prompt-driven image editing dataset comprising 700 images paired with editing instructions across diverse scene types and editing operations (e.g., object addition, removal, attribute modification) designed to evaluate text-driven image editing performance. In terms of the self-supervised training of DeepInv, we also create a dataset of real images from the COCO dataset , selecting samples with near-square aspect ratios to ensure compatibility with all inversion framework. The selected images span a wide range of categories, including animals, objects, and other diverse content, and are resized to a resolution of $`1024 \times 1024`$ pixels. The dataset is split into 2,000 images for training and 298 images for testing, with no overlap between the two sets. Following previous works , we adopt the widely used inversion metrics, including LPIPS , SSIM , PSNR , MSE and FID  as evaluation metrics.

Quantitative Results

Comparison with existing methods. We first compare our DeepInv with several state-of-the-art (SOTA) inversion methods in Tab. [precision_table]. From this table, we can first observe that traditional inversion methods such as DDIM Inversion  and EasyInv  achieve limited reconstruction fidelity, as reflected by the relatively low PSNR and SSIM values. For instance, DDIM Inversion shows significant degradation under accumulated errors, while EasyInv improves upon DDIM but still struggles to maintain fine-grained visual consistency. This can be attributed to their reliance on either handcrafted inversion trajectories or simple noise estimation without self-supervised refinement. In contrast, our DeepInv achieves substantial improvements across all metrics, with a remarkable +129.1% gain in PSNR and +80.2% in SSIM compared to DDIM Inversion, and clear advantages over EasyInv (+75.3% in FID and +74.5% in LPIPS). Moreover, DeepInv demonstrates superior efficiency, e.g., operating 9887.5% faster than the iterative ReNoise  method, while requiring only 14 additional seconds compare to DDIM Inversion, which represents the theoretical upper limit of inversion task. These experiments well confirm the effectiveness of DeepInv in achieving high-fidelity and high-efficiency diffusion inversion.

Comparison on downstream editing tasks. To further evaluate the versatility of DeepInv, we integrate it into two representative diffusion-based editing approaches, i.e., RF Inversion  and FTEdit , and report the results in Tab. [downstream]. We also include DVRF , a leading inversion-free editing model, as a reference baseline for end-to-end diffusion editing. The comparison reveals several interesting observations. Firstly, without our solver, inversion-based methods such as RF Inversion and FTEdit often struggle to maintain both structural fidelity and visual coherence, showing inconsistent texture reconstruction and noticeable semantic drift. For example, RF Inversion tends to significant inconsistency on local details though preserving global layout, while FTEdit occasionally suffers from unnatural transitions around edited areas. These artifacts mainly arise from imperfect or unstable inversion noise estimation, which directly affects the downstream editing quality. In stark contrast, when applying our DeepInv solver, both RF Inversion and FTEdit demonstrate substantial and consistent improvements across nearly all metrics. DeepInv solver not only stabilizes the inversion process but also enhances semantic alignment and texture consistency, allowing these methods to outperform the inversion-free DVRF baseline on multiple quantitative indicators. The only exception appears in FTEdit’s LPIPS score, in which the original score is slightly higher. Overall, these results highlight that DeepInv serves as a powerful and general inversion backbone, i.e., being capable of elevating diverse diffusion-based editing pipelines by providing more faithful inversions. These results also reveal that inversion-based editing frameworks possess significant potential limitations in their prior performance, which may primarily stem from the lack of a proper inversion mechanism rather than the editing formulation itself.

style="width:100.0%" />
Visualized comparison between DeepInv Solver and existing methods on the task of image inversion (left) and editing (right), respectively. For the image editing task, we integrate DeepInv solver into two representative inversion-based diffusion editing methods, i.e., RF-Inv  and FTEdit , by replacing their original inversion modules. DVRF  is a SOTA and inversion-free method. In each example, the object in the original image is marked in red, and the replaced (or added) object is highlighted in blue. The first-row example illustrates an object-addition scenario with only the blue prompts. According to shown images, DeepInv solver consistently achieves more faithful inversions and leads to visually coherent edits.

Ablation Studies. We then ablate the key designs of DeepInv in Tab.1 and Tab.2 to examine the impact of our noise interpolation module , i.e., Eq.[eq:add_noise]. We conduct a controlled experiment by applying the same interpolation strategy to two strong baselines, i.e., DDIM inversion and EasyInv. As shown in Tab.1, even under this setup, our method consistently outperforms both baselines by a large margin. This result indicates that while noise interpolation contributes positively, the gains achieved by our framework does not solely rely on this component, but instead reflect the overall effectiveness and robustness of our design. Tab.2 summarizes the impact of layer extension. We observe that additional layers generally improves performance, confirming the benefit of increased model capacity. However, adding layers to both branches leads to degraded results and increased computational cost. We attribute this to the processing of DDIM-inverted noise in the left branch which provides high quality prior information. It requires minimal modification, and excessive complexity may hinder convergence and introduce instability. Thus, the configuration that adds layers only to the right branch proves most effective and become our final choice.

Qualitative Results

To further demonstrate the effectiveness of the proposed DeepInv, we visualize its performance on a range of downstream image editing and inversion tasks in Fig. 5, which includes comparisons with both diffusion-based editing pipelines and recent advanced inversion approaches.

In the left part of Fig. 5, we present a comparison between DeepInv and other advanced inversion methods, including DDIM Inversion , EasyInv , and ReNoise . While DDIM Inversion and EasyInv can produce structurally coherent outputs, they often fail to preserve high-frequency visual details. Although being capable of reconstructing global layouts, ReNoise  tends to introduce color inconsistencies and unnatural tones. In contrast, DeepInv yields reconstructions that remain faithful to the original images in both structure and texture, achieving nearly imperceptible differences between the reconstructed and original visuals. These results confirm that DeepInv provides the most faithful and perceptually realistic inversions among existing methods. In the right part of Fig. 5, we evaluate DeepInv on downstream editing tasks by integrating it into two representative inversion-based diffusion editing methods namely RF-Inv  and FTEdit . We replacing their original inversion modules by our DeepInv solver. We also compare against DVRF , a SOTA and inversion-free editing method. As shown, the use of DeepInv markedly improves reconstruction fidelity and semantic consistency across both edited and non-edited regions further confirming the contributions of our work. Compared to the inversion-free DVRF, methods equipped with our solver better preserve fine-grained details and maintain stronger spatial alignment between the edited object and the original background. These qualitative comparisons highlight that DeepInv not only strengthens inversion-based editing pipelines but also ensures superior structural coherence and visual realism across diverse editing scenarios.

Conclusion

In this paper, we present DeepInv, a novel self-supervised framework for diffusion inversion that for the first time enables accurate and efficient inversion through a trained end-to-end solver. Extensive experiments demonstrate that DeepInv outperforms existing inversion methods in both reconstruction quality and computational efficiency. Its integration with existing end-to-end editing methods not only improves output quality but also offers a promising direction for highly controllable diffusion-based real image editing.


📊 ë…ŒëŹž ì‹œê°ìžëŁŒ (Figures)

Figure 1



Figure 2



Figure 3



Figure 4



Figure 5



Figure 6



Figure 7



A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

Start searching

Enter keywords to search articles

↑↓
↔
ESC
⌘K Shortcut