A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 15.3, Top 1 (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
J. Liu, X. Li, Z. Wang, Z. Jiang, W. Zhong, W. Fan, and B. Xu, “PromptFusion: Harmonized semantic prompt learning for infrared and visible image fusion,” IEEE/CAA J. Autom. Sinica, 2024. doi: 10.1109/JAS.2024.124878
Citation: J. Liu, X. Li, Z. Wang, Z. Jiang, W. Zhong, W. Fan, and B. Xu, “PromptFusion: Harmonized semantic prompt learning for infrared and visible image fusion,” IEEE/CAA J. Autom. Sinica, 2024. doi: 10.1109/JAS.2024.124878

PromptFusion: Harmonized Semantic Prompt Learning for Infrared and Visible Image Fusion

doi: 10.1109/JAS.2024.124878
Funds:  This work was partially supported by China Postdoctoral Science Foundation (2023M730741) and the National Natural Science Foundation of China (52102432, 52202452, 62372080, 62302078)
More Information
  • The goal of infrared and visible image fusion (IVIF) is to integrate the unique advantages of both modalities to achieve a more comprehensive understanding of a scene. However, existing methods struggle to effectively handle modal disparities, resulting in visual degradation of the details and prominent targets of the fused images. To address these challenges, we introduce PromptFusion, a prompt-based approach that harmoniously combines multi-modality images under the guidance of semantic prompts. Firstly, to better characterize the features of different modalities, a contourlet autoencoder is designed to separate and extract the high-/low-frequency components of different modalities, thereby improving the extraction of fine details and textures. We also introduce a prompt learning mechanism using positive and negative prompts, leveraging Vision-Language Models to improve the fusion model’s understanding and identification of targets in multi-modality images, leading to improved performance in downstream tasks. Furthermore, we employ bi-level asymptotic convergence optimization. This approach simplifies the intricate non-singleton non-convex bi-level problem into a series of convergent and differentiable single optimization problems that can be effectively resolved through gradient descent. Our approach advances the state-of-the-art, delivering superior fusion quality and boosting the performance of related downstream tasks. Project page: https://github.com/hey-it-s-me/PromptFusion

     

  • loading
  • [1]
    G. Pajares and J. M. De La Cruz, “A wavelet-based image fusion tutorial,” Pattern Recognition, vol. 37, no. 9, pp. 1855–1872, 2004. doi: 10.1016/j.patcog.2004.03.010
    [2]
    S. Li, B. Yang, and J. Hu, “Performance comparison of different multiresolution transforms for image fusion,” Information Fusion, vol. 12, no. 2, pp. 74–84, 2011. doi: 10.1016/j.inffus.2010.03.002
    [3]
    Z. Zhang and R. S. Blum, “A categorization of multiscaledecomposition-based image fusion schemes with a performance study for a digital camera application,” Proc. the IEEE, vol. 87, no. 8, pp. 1315–1326, 1999. doi: 10.1109/5.775414
    [4]
    J. Wang, J. Peng, X. Feng, G. He, and J. Fan, “Fusion method for infrared and visible images by using non-negative sparse representation,” Infrared Physics & Technology, vol. 67, pp. 477–489, 2014.
    [5]
    S. Li, H. Yin, and L. Fang, “Group-sparse representation with dictionary learning for medical image denoising and fusion,” IEEE Trans. biomedical engineering, vol. 59, no. 12, pp. 3450–3459, 2012. doi: 10.1109/TBME.2012.2217493
    [6]
    R. Hou, D. Zhou, R. Nie, D. Liu, L. Xiong, Y. Guo, and C. Yu, “Vif-net: An unsupervised framework for infrared and visible image fusion,” IEEE Trans. Computational Imaging, vol. 6, pp. 640–651, 2020. doi: 10.1109/TCI.2020.2965304
    [7]
    Z. Zhao, S. Xu, C. Zhang, J. Liu, P. Li, and J. Zhang, “Didfuse: Deep image decomposition for infrared and visible image fusion,” arXiv preprint arXiv: 2003.09210, 2020.
    [8]
    H. Li, Y. Cen, Y. Liu, X. Chen, and Z. Yu, “Different input resolutions and arbitrary output resolution: A meta learning-based deep framework for infrared and visible image fusion,” IEEE Trans. Image Processing, vol. 30, pp. 4070–4083, 2021. doi: 10.1109/TIP.2021.3069339
    [9]
    J. Liu, X. Fan, J. Jiang, R. Liu, and Z. Luo, “Learning a deep multiscale feature ensemble and an edge-attention guidance for image fusion,” IEEE Trans. Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 105–119, 2021.
    [10]
    H. Xu, J. Yuan, and J. Ma, “Murf: Mutually reinforcing multi-modal image registration and fusion,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2023.
    [11]
    H. Li, J. Liu, Y. Zhang, and Y. Liu, “A deep learning framework for infrared and visible image fusion without strict registration,” Int. Journal of Computer Vision, pp. 1–20, 2023.
    [12]
    Y. Rao, D. Wu, M. Han, T. Wang, Y. Yang, T. Lei, C. Zhou, H. Bai, and L. Xing, “At-gan: A generative adversarial network with attention and transition for infrared and visible image fusion,” Information Fusion, vol. 92, pp. 336–349, 2023. doi: 10.1016/j.inffus.2022.12.007
    [13]
    H. Li, J. Zhao, J. Li, Z. Yu, and G. Lu, “Feature dynamic alignment and refinement for infrared–visible image fusion: Translation robust fusion,” Information Fusion, vol. 95, pp. 26–41, 2023. doi: 10.1016/j.inffus.2023.02.011
    [14]
    J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “Ddcgan: A dual-discriminator conditional generative adversarial network for multiresolution image fusion,” IEEE Trans. Image Processing, vol. 29, pp. 4980–4995, 2020. doi: 10.1109/TIP.2020.2977573
    [15]
    H. Xu, X. Wang, and J. Ma, “Drf: Disentangled representation for visible and infrared image fusion,” IEEE Trans. Instrumentation and Measurement, vol. 70, pp. 1–13, 2021.
    [16]
    W. Tang, F. He, and Y. Liu, “Ydtr: Infrared and visible image fusion via y-shape dynamic transformer,” IEEE Trans. Multimedia, 2022.
    [17]
    M. Han, K. Yu, J. Qiu, H. Li, D. Wu, Y. Rao, Y. Yang, L. Xing, H. Bai, and C. Zhou, “Boosting target-level infrared and visible image fusion with regional information coordination,” Information Fusion, vol. 92, pp. 268–288, 2023. doi: 10.1016/j.inffus.2022.12.005
    [18]
    J. Yue, L. Fang, S. Xia, Y. Deng, and J. Ma, “Dif-fusion: Towards high color fidelity in infrared and visible image fusion with diffusion models,” IEEE Trans. Image Processing, 2023.
    [19]
    X. Yi, L. Tang, H. Zhang, H. Xu, and J. Ma, “Diff-if: Multi-modality image fusion via diffusion model with fusion knowledge prior,” Information Fusion, p. 102450, 2024.
    [20]
    Y. Liu, Y. Shi, F. Mu, J. Cheng, and X. Chen, “Glioma segmentationoriented multi-modal mr image fusion with adversarial learning,” IEEE/CAA journal of automatica sinica, vol. 9, no. 8, pp. 1528–1531, 2022. doi: 10.1109/JAS.2022.105770
    [21]
    Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image fusion with convolutional sparse representation,” IEEE signal processing letters, vol. 23, no. 12, pp. 1882–1886, 2016. doi: 10.1109/LSP.2016.2618776
    [22]
    Y. Liu, C. Yu, J. Cheng, Z. J. Wang, and X. Chen, “Mm-net: A mixformer-based multi-scale network for anatomical and functional image fusion,” IEEE Trans. Image Processing, vol. 33, pp. 2197–2212, 2024. doi: 10.1109/TIP.2024.3374072
    [23]
    Z. Zhao, H. Bai, J. Zhang, Y. Zhang, S. Xu, Z. Lin, R. Timofte, and L. Van Gool, “Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2023, pp. 5906–5916.
    [24]
    R. Liu, Z. Liu, J. Liu, X. Fan, and Z. Luo, “A task-guided, implicitlysearched and metainitialized deep model for image fusion,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2024.
    [25]
    J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-aware dual adversarial learning and a multi-scenario multimodality benchmark to fuse infrared and visible for object detection,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2022, pp. 5802–5811.
    [26]
    W. Zhao, S. Xie, F. Zhao, Y. He, and H. Lu, “Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2023, pp. 13955–13965.
    [27]
    J. Liu, Z. Liu, G. Wu, L. Ma, R. Liu, W. Zhong, Z. Luo, and X. Fan, “Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,” in Proc. the IEEE/CVF Int. Conf. on Computer Vision, 2023, pp. 8115–8124.
    [28]
    Z. Liu, J. Liu, G. Wu, L. Ma, X. Fan, and R. Liu, “Bi-level dynamic learning for jointly multi-modality image fusion and beyond,” arXiv preprint arXiv: 2305.06720, 2023.
    [29]
    J. Li, J. Chen, J. Liu, and H. Ma, “Learning a graph neural network with cross modality interaction for image fusion,” in Proc. the 31st ACM Int. Conf. on Multimedia, 2023, pp. 4471–4479.
    [30]
    D. Wang, J. Liu, R. Liu, and X. Fan, “An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection,” Information Fusion, vol. 98, p. 101828, 2023. doi: 10.1016/j.inffus.2023.101828
    [31]
    Z. Zhao, S. Xu, J. Zhang, C. Liang, C. Zhang, and J. Liu, “Efficient and model-based infrared and visible image fusion via algorithm unrolling,” IEEE Trans. Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1186–1196, 2021.
    [32]
    R. Nie, C. Ma, J. Cao, H. Ding, and D. Zhou, “A total variation with joint norms for infrared and visible image fusion,” IEEE Trans. Multimedia, vol. 24, pp. 1460–1472, 2021.
    [33]
    R. Liu, Z. Liu, J. Liu, and X. Fan, “Searching a hierarchically aggregated fusion architecture for fast multi-modality image fusion,” in Proc. the 29th ACM Int. Conf. on Multimedia, 2021, pp. 1600–1608.
    [34]
    X. Tian, W. Zhang, D. Yu, and J. Ma, “Sparse tensor prior for hyperspectral, multispectral, and panchromatic image fusion,” IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 1, pp. 284–286, 2022.
    [35]
    J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, and Y. Ma, “Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 7, pp. 1200–1217, 2022. doi: 10.1109/JAS.2022.105686
    [36]
    X. Yi, H. Xu, H. Zhang, L. Tang, and J. Ma, “Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion,” arXiv preprint arXiv: 2403.16387, 2024.
    [37]
    D. Wu, M. Han, Y. Yang, S. Zhao, Y. Rao, H. Li, X. Lin, C. Zhou, and H. Bai, “Dcfusion: A dual-frequency cross-enhanced fusion network for infrared and visible image fusion,” IEEE Trans. Instrumentation and Measurement, 2023.
    [38]
    H. Li and X.-J. Wu, “Densefuse: A fusion approach to infrared and visible images,” IEEE Trans. Image Processing, vol. 28, no. 5, pp. 2614–2623, 2018.
    [39]
    H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unified unsupervised image fusion network,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 502–518, 2020.
    [40]
    J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “Fusiongan: A generative adversarial network for infrared and visible image fusion,” Information Fusion, vol. 48, pp. 11–26, 2019. doi: 10.1016/j.inffus.2018.09.004
    [41]
    D. Wang, J. Liu, X. Fan, and R. Liu, “Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration,” arXiv preprint arXiv: 2205.11876, 2022.
    [42]
    Z. Liu, J. Liu, B. Zhang, L. Ma, X. Fan, and R. Liu, “Paif: Perceptionaware infrared-visible image fusion for attack-tolerant semantic segmentation,” in Proc. the 31st ACM Int. Conf. on Multimedia, 2023, pp. 3706–3714.
    [43]
    A. Joulin, L. Van Der Maaten, A. Jabri, and N. Vasilache, “Learning visual features from large weakly supervised data,” in Proc. the European Conf. on Computer Vision, 2016, pp. 67–84.
    [44]
    A. Li, A. Jabri, A. Joulin, and L. Van Der Maaten, “Learning visual n-grams from web data,” in Proc. the IEEE Int. Conf. on Computer Vision, 2017, pp. 4183–4192.
    [45]
    K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2021, pp. 11162–11173.
    [46]
    P. Müller, G. Kaissis, C. Zou, and D. Rueckert, “Joint learning of local-ized representations from medical images and reports,” in Proc. the European Conf. on Computer Vision, 2022, pp. 685–701.
    [47]
    M. B. Sariyildiz, J. Perez, and D. Larlus, “Learning visual representations with caption annotations,” in Proc. the European Conf. on Computer Vision, 2020, pp. 153–170.
    [48]
    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
    [49]
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” arXiv preprint 2103.00020, 2021.
    [50]
    M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2023, pp. 19113–19122.
    [51]
    C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, “Domain adaptation via prompt learning,” IEEE Trans. Neural Networks and Learning Systems, 2023.
    [52]
    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825.
    [53]
    S. Ma, C.-W. Xie, Y. Wei, S. Sun, J. Fan, X. Bao, Y. Guo, and Y. Zheng, “Understanding the multi-modal prompts of the pre-trained vision-language model,” arXiv preprint 2312.11570, 2024.
    [54]
    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for visionlanguage models,” Int. Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022. doi: 10.1007/s11263-022-01653-1
    [55]
    Z. Zhao, L. Deng, H. Bai, Y. Cui, Z. Zhang, Y. Zhang, H. Qin, D. Chen, J. Zhang, P. Wang, and L. V. Gool, “Image fusion via vision-language model,” in Proc. the Int. Conf. on Machine Learning (ICML), 2024.
    [56]
    X. Li, Y. Zou, J. Liu, Z. Jiang, L. Ma, X. Fan, and R. Liu, “From text to pixels: A context-aware semantic synergy solution for infrared and visible image fusion,” arXiv preprint arXiv: 2401.00421, 2023.
    [57]
    O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” arXiv preprint arXiv: 2103.17249, 2021.
    [58]
    R. Gal, O. Patashnik, H. Maron, G. Chechik, and D. Cohen-Or, “Stylegan-nada: Clip-guided domain adaptation of image generators,” arXiv preprint 2108.00946, 2021.
    [59]
    M. Liu, L. Jiao, X. Liu, L. Li, F. Liu, and S. Yang, “C-cnn: Contourlet convolutional neural networks,” IEEE Trans. Neural Networks and Learning Systems, vol. 32, no. 6, pp. 2636–2649, 2020.
    [60]
    L. Tang, J. Yuan, H. Zhang, X. Jiang, and J. Ma, “Piafusion: A progressive infrared and visible image fusion network based on illumination aware,” Information Fusion, vol. 83, pp. 79–92, 2022.
    [61]
    A. Toet and M. A. Hogervorst, “Progress in color night vision,” Optical Engineering, vol. 51, no. 1, pp. 010 901–010 901, 2012. doi: 10.1117/1.OE.51.1.010901
    [62]
    H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo, “Fusiondn: A unified densely connected network for image fusion,” in Proc. the AAAI Conf. on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12484–12491.
    [63]
    Z. Zhao, H. Bai, J. Zhang, Y. Zhang, K. Zhang, S. Xu, D. Chen, R. Timofte, and L. Van Gool, “Equivariant multi-modality image fusion,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2024, pp. 25912–25921.
    [64]
    J. Liu, R. Lin, G. Wu, R. Liu, Z. Luo, and X. Fan, “Coconet: Coupled contrastive learning network with multi-level feature ensemble for multimodality image fusion,” Int. Journal of Computer Vision, pp. 1–28, 2023.
    [65]
    H. Li, T. Xu, X.-J. Wu, J. Lu, and J. Kittler, “Lrrnet: A novel representation learning guided fusion network for infrared and visible images,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2023.
    [66]
    Z. Zhao, H. Bai, Y. Zhu, J. Zhang, S. Xu, Y. Zhang, K. Zhang, D. Meng, R. Timofte, and L. Van Gool, “Ddfm: denoising diffusion model for multi-modality image fusion,” in Proc. the IEEE/CVF Int. Conf. on Computer Vision, 2023, pp. 8082–8093.
    [67]
    Z. Huang, J. Liu, X. Fan, R. Liu, W. Zhong, and Z. Luo, “Reconet: Recurrent correction network for fast and efficient multi-modality image fusion,” in Proc. the European Conf. on Computer Vision, 2022, pp. 539–555.
    [68]
    J. Ma, Y. Ma, and C. Li, “Infrared and visible image fusion methods and applications: A survey,” Information Fusion, vol. 45, pp. 153–178, 2019. doi: 10.1016/j.inffus.2018.02.004
    [69]
    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Proc. the Advances in Neural Information Processing Systems, vol. 34, pp. 12077–12090, 2021.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(13)  / Tables(5)

    Article Metrics

    Article views (7) PDF downloads(4) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return