사용자 지시 기반 확산 트랜스포머를 활용한 다중모달 이미지 융합 프레임워크
📝 Abstract
Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation. The code is available at https://github.com/Henry-Lee-real/DiTFuse .
💡 Analysis
Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation. The code is available at https://github.com/Henry-Lee-real/DiTFuse .
📄 Content
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 1 Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach Jiayang Li∗, Chengjie Jiang∗, Junjun Jiang†, Senior Member, IEEE, Pengwei Liang, Jiayi Ma, Senior Member, IEEE, and Liqiang Nie, Senior Member, IEEE Abstract—Image fusion aims to blend complementary infor- mation from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal align- ment. We therefore present DiTFuse, instruction-driven Diffu- sion–Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware fea- ture selection without relying on ground truth images. A cu- rated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies in- frared–visible, multi-focus, and multi-exposure fusion—as well as text-controlled refinement and downstream tasks—within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation. The code is available at https://github.com/Henry-Lee-real/DiTFuse . Index Terms—Image Fusion, DiT, Text Control I. INTRODUCTION Due to the inherent performance bottlenecks of hardware and the complexity of the perceived environment, a single imaging modality can only capture partial information of a natural scene. Image fusion technology integrates complemen- tary information from multiple sources to generate a fused image with more comprehensive information. Depending on J. Li, P. Liang and J. Jiang are with the Faculty of Computing, Harbin Institute of Technology, Harbin 150001. E-mail: lijiayang.cs@gmail.com, erfect2020@gmail.com, jiangjunjun@hit.edu.cn. C. Jiang is with Tsinghua Shenzhen International Graduate School, Ts- inghua University, Shenzhen, China. E-mail: 18601580580@163.com. J. Ma is with the Electronic Information School, Wuhan University, Wuhan 430072, China. E-mail: jyma2010@gmail.com. L. Nie is with the School of Computer Science and Technology, HarbinInstitute of Technology (Shenzhen), Shenzhen 518055, China. E- mail:nieliqiang@gmail.com. ∗Jiayang Li and Chengjie Jiang contributed equally to this work. †Corresponding author: Junjun Jiang. the specific task, image fusion techniques mainly include in- frared–visible image fusion (IVIF), multi-focus fusion (MFF), and multi-exposure fusion (MEF). These fusion technolo- gies are widely used in mobile photography [1], [2], [3], autonomous driving [4], and medical imaging [5], where they play a crucial role in enhancing scene perception and improving visual effects. Although existing methods have demonstrated excellent visual effects in their fusion results, they often fail to signifi- cantly improve the accuracy of downstream tasks that use these fused images. To enhance both the visual perceptual quality of the fusion results and the performance of downstream tasks simultaneously, much current work injects high-level semantic information into the fusion network in various ways, making the semantic information of objects in the fused image more prominent. These approaches primarily rely on joint optimization with downstream tasks, injecting high-level semantic information into the fusion network via gradient backpropagation. While these methods can effectively inject semantic information to some extent, they all depend on guidance from external models or complex network designs. In addition, image fusion faces another significant problem. Most fusion algorithms are not sufficiently robust for com- plex scenes. Specifically, if the input images are too dark (or overexposed) or have color casts, these methods often carry these defects directly into the final result. Consequently, some teams have designed specialized fusion networks to handle fusion problems i
This content is AI-processed based on ArXiv data.