시점별 시각 전문가와 자기지도 융합 기반 무노이즈 기하학적 사전

Reading time: 5 minute
...

📝 Abstract

fusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that D³-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at https://x-gengroup.github.io/HomePage _ D3-Predictor/.

💡 Analysis

fusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that D³-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at https://x-gengroup.github.io/HomePage _ D3-Predictor/.

📄 Content

Figure 1. We present D³-Predictor, a noise-free deterministic diffusion model achieving superior performance and generalization across various dense prediction tasks, with less than half the training data previously used and efficiently performing inference in a single step.

Although diffusion models with strong visual priors have emerged as powerful dense prediction backbones, they overlook a core limitation: the stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires a deterministic mapping from image to geometry. In this paper, we show that this stochastic noise corrupts fine-grained spatial cues and pushes the model toward timestep-specific noise objectives, consequently destroying meaningful geometric structure mappings. To address this, we introduce D³-Predictor, a noise-free deterministic diffusion-based dense prediction model built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs to leverage diffusion priors, D³-Predictor views the pretrained dif-

Dense prediction, such as depth estimation [50] and surface normal estimation [76], are fundamental tasks in computer vision, with numerous applications like autonomous driving [6,20], scene reconstruction [10,90], inverse rendering [9,87], and so on. Although state-of-the-art discriminative dense prediction models [4,25] achieve impressive performance, they still struggle to capture fine-grained high-frequency details. To address this limitation, current works [17,31,41,84,92] reformulate dense prediction as an image-conditioned iterative denoising process based on diffusion models [66]. By leveraging powerful visual priors of diffusion models, these methods can produce dense predictions results with fine-grained geometric details.

While these diffusion-based dense prediction methods have demonstrated promising results, they still suffer from the stochastic noise inherent in diffusion models. Stochastic noise is a fundamental component of diffusion models, enabling sample diversity that is particularly beneficial for creative image [44,58,79] and video [75,83,91] generation. However, dense prediction is intrinsically deterministic, indicating that the stochastic noise essential to diffusion models may be misaligned with the task’s deterministic objective. We posit that the stochastic noise introduces two critical issues for dense prediction tasks: 1) Stochastic noise disrupts the geometric structures and small-scale objects in the input image, thereby degrading the integrity of input information and impeding precise spatial perception (cf. Fig. 2 (a)); 2) The stochastic noise drives diffusion models to focus on modeling noise distributions [24] instead of establishing the geometric structure mappings essential for dense prediction tasks [17]. Moreover, the iterative denoising process of the diffusion model further incurs substantial inference overhead. These observations motivate a fundamental question: Can diffusion models be reformulated into a noise-free deterministic framework to better suit dense prediction tasks?

Recent works attempt to employ deterministic noise to alleviate the stochasticity in diffusion-based dense prediction methods. For example, GenPercept [80] and E2E-FT [18] eliminate stochasticity by fixing the noise schedule (a function of the timestep) to introduce deterministic noise. However, diffusion models learn timestep-specific objectives [2,3], leading to distinct diffusion priors at each timestep. Consequently, fixing the noise schedule disrupts this prior structure, resulting in incomplete priors and a loss of geometric fidelity (cf. Fig. 2 (b)). On the other hand, StableNormal [84] suppresses stochasticity while preserving diffusion priors via a more complex two-stage pipeline with external DINO [55] guidance. However, this design incurs a higher computational overhead.

In this work, we aim to fully eliminate the adverse effects of stochastic noise on diffusion-based dense prediction, without compromising the diffusion prior and with minimal additional computational cost. To this end, we propose D³-Predictor, a noise-free deterministic diffusionbased dense prediction model initialized from the pretrained diffusion model. Specifically, we treat the pretrained diffusion models at different timesteps as an ensemble of visual experts following the CleanDIFT [70] paradigm, each exhibiting distinct timestep-dependent diffusion priors. In this context, each visual expert takes a noisy image together with its timestep as input, while our D³-Predictor operates directly on a clean image. D³-Predictor then aggregates diffusion priors from multiple visual experts into a complete and noise-free one in a lightweight self-supervised manner, by aligning its internal representations with those of visual experts. We simultaneously apply task-specific supervision to the D³-Predictor to easily leverage this aggregated diffusion

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut