A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 12 Issue 3
Mar.  2025

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 15.3, Top 1 (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
J. Liu, X. Li, Z. Wang, Z. Jiang, W. Zhong, W. Fan, and  B. Xu,  “PromptFusion: Harmonized semantic prompt learning for infrared and visible image fusion,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 3, pp. 502–515, Mar. 2025. doi: 10.1109/JAS.2024.124878
Citation: J. Liu, X. Li, Z. Wang, Z. Jiang, W. Zhong, W. Fan, and  B. Xu,  “PromptFusion: Harmonized semantic prompt learning for infrared and visible image fusion,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 3, pp. 502–515, Mar. 2025. doi: 10.1109/JAS.2024.124878

PromptFusion: Harmonized Semantic Prompt Learning for Infrared and Visible Image Fusion

doi: 10.1109/JAS.2024.124878
Funds:  This work was partially supported by China Postdoctoral Science Foundation (2023M730741) and the National Natural Science Foundation of China (U22B2052, 52102432, 52202452, 62372080, 62302078)
More Information
  • The goal of infrared and visible image fusion (IVIF) is to integrate the unique advantages of both modalities to achieve a more comprehensive understanding of a scene. However, existing methods struggle to effectively handle modal disparities, resulting in visual degradation of the details and prominent targets of the fused images. To address these challenges, we introduce PromptFusion, a prompt-based approach that harmoniously combines multi-modality images under the guidance of semantic prompts. Firstly, to better characterize the features of different modalities, a contourlet autoencoder is designed to separate and extract the high-/low-frequency components of different modalities, thereby improving the extraction of fine details and textures. We also introduce a prompt learning mechanism using positive and negative prompts, leveraging Vision-Language Models to improve the fusion model’s understanding and identification of targets in multi-modality images, leading to improved performance in downstream tasks. Furthermore, we employ bi-level asymptotic convergence optimization. This approach simplifies the intricate non-singleton non-convex bi-level problem into a series of convergent and differentiable single optimization problems that can be effectively resolved through gradient descent. Our approach advances the state-of-the-art, delivering superior fusion quality and boosting the performance of related downstream tasks. Project page: https://github.com/hey-it-s-me/PromptFusion.

     

  • loading
  • [1]
    G. Pajares and J. M. de la Cruz, “A wavelet-based image fusion tutorial,” Pattern Recognit., vol. 37, no. 9, pp. 1855–1872, Sep. 2004. doi: 10.1016/j.patcog.2004.03.010
    [2]
    S. Li, B. Yang, and J. Hu, “Performance comparison of different multi-resolution transforms for image fusion,” Inf. Fusion, vol. 12, no. 2, pp. 74–84, Apr. 2011. doi: 10.1016/j.inffus.2010.03.002
    [3]
    Z. Zhang and R. S. Blum, “A categorization of multiscale-decomposition-based image fusion schemes with a performance study for a digital camera application,” Proc. IEEE, vol. 87, no. 8, pp. 1315–1326, Aug. 1999. doi: 10.1109/5.775414
    [4]
    J. Wang, J. Peng, X. Feng, G. He, and J. Fan, “Fusion method for infrared and visible images by using non-negative sparse representation,” Infrared Phys. Technol., vol. 67, pp. 477–489, Nov. 2014. doi: 10.1016/j.infrared.2014.09.019
    [5]
    S. Li, H. Yin, and L. Fang, “Group-sparse representation with dictionary learning for medical image denoising and fusion,” IEEE Trans. Biomed. Eng., vol. 59, no. 12, pp. 3450–3459, Dec. 2012. doi: 10.1109/TBME.2012.2217493
    [6]
    R. Hou, D. Zhou, R. Nie, D. Liu, L. Xiong, Y. Guo, and C. Yu, “VIF-Net: An unsupervised framework for infrared and visible image fusion,” IEEE Trans. Comput. Imaging, vol. 6, pp. 640–651, Jan. 2020. doi: 10.1109/TCI.2020.2965304
    [7]
    Z. Zhao, S. Xu, C. Zhang, J. Liu, P. Li, and J. Zhang, “DIDFuse: Deep image decomposition for infrared and visible image fusion,” in Proc. 29th Int. Joint Conf. Artificial Intelligence, Yokohama, Japan, 2020, pp. 970–976.
    [8]
    H. Li, Y. Cen, Y. Liu, X. Chen, and Z. Yu, “Different input resolutions and arbitrary output resolution: A meta learning-based deep framework for infrared and visible image fusion,” IEEE Trans. Image Process., vol. 30, pp. 4070–4083, Apr. 2021. doi: 10.1109/TIP.2021.3069339
    [9]
    J. Liu, X. Fan, J. Jiang, R. Liu, and Z. Luo, “Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 105–119, Jan. 2022. doi: 10.1109/TCSVT.2021.3056725
    [10]
    H. Xu, J. Yuan, and J. Ma, “MURF: Mutually reinforcing multi-modal image registration and fusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 12148–12166, Oct. 2023. doi: 10.1109/TPAMI.2023.3283682
    [11]
    H. Li, J. Liu, Y. Zhang, and Y. Liu, “A deep learning framework for infrared and visible image fusion without strict registration,” Int. J. Comput. Vis., vol. 132, no. 5, pp. 1625–1644, May. 2024. doi: 10.1007/s11263-023-01948-x
    [12]
    Y. Rao, D. Wu, M. Han, T. Wang, Y. Yang, T. Lei, C. Zhou, H. Bai, and L. Xing, “AT-GAN: A generative adversarial network with attention and transition for infrared and visible image fusion,” Inf. Fusion, vol. 92, pp. 336–349, Apr. 2023. doi: 10.1016/j.inffus.2022.12.007
    [13]
    H. Li, J. Zhao, J. Li, Z. Yu, and G. Lu, “Feature dynamic alignment and refinement for infrared-visible image fusion: Translation robust fusion,” Inf. Fusion, vol. 95, pp. 26–41, Jul. 2023. doi: 10.1016/j.inffus.2023.02.011
    [14]
    J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion,” IEEE Trans. Image Process., vol. 29, pp. 4980–4995, Mar. 2020. doi: 10.1109/TIP.2020.2977573
    [15]
    H. Xu, X. Wang, and J. Ma, “DRF: Disentangled representation for visible and infrared image fusion,” IEEE Trans. Instrum. Meas., vol. 70, p. 5006713, Feb. 2021.
    [16]
    W. Tang, F. He, and Y. Liu, “YDTR: Infrared and visible image fusion via Y-shape dynamic transformer,” IEEE Trans. Multimedia, vol. 25, pp. 5413–5428, Jul. 2023. doi: 10.1109/TMM.2022.3192661
    [17]
    M. Han, K. Yu, J. Qiu, H. Li, D. Wu, Y. Rao, Y. Yang, L. Xing, H. Bai, and C. Zhou, “Boosting target-level infrared and visible image fusion with regional information coordination,” Inf. Fusion, vol. 92, pp. 268–288, Apr. 2023. doi: 10.1016/j.inffus.2022.12.005
    [18]
    J. Yue, L. Fang, S. Xia, Y. Deng, and J. Ma, “Dif-fusion: Toward high color fidelity in infrared and visible image fusion with diffusion models,” IEEE Trans. Image Process., vol. 32, pp. 5705–5720, Oct. 2023. doi: 10.1109/TIP.2023.3322046
    [19]
    X. Yi, L. Tang, H. Zhang, H. Xu, and J. Ma, “Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior,” Inf. Fusion, vol. 110, p. 102450, Oct. 2024. doi: 10.1016/j.inffus.2024.102450
    [20]
    Y. Liu, Y. Shi, F. Mu, J. Cheng, and X. Chen, “Glioma segmentation-oriented multi-modal MR image fusion with adversarial learning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 8, pp. 1528–1531, Aug. 2022. doi: 10.1109/JAS.2022.105770
    [21]
    Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image fusion with convolutional sparse representation,” IEEE Signal Process. Lett., vol. 23, no. 12, pp. 1882–1886, Dec. 2016. doi: 10.1109/LSP.2016.2618776
    [22]
    Y. Liu, C. Yu, J. Cheng, Z. J. Wang, and X. Chen, “MM-Net: A MixFormer-based multi-scale network for anatomical and functional image fusion,” IEEE Trans. Image Process., vol. 33, pp. 2197–2212, Mar. 2024. doi: 10.1109/TIP.2024.3374072
    [23]
    Z. Zhao, H. Bai, J. Zhang, Y. Zhang, S. Xu, Z. Lin, R. Timofte, and L. Van Gool, “CDDFuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 5906–5916.
    [24]
    R. Liu, Z. Liu, J. Liu, X. Fan, and Z. Luo, “A task-guided, implicitly-searched and meta-initialized deep model for image fusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6594–6609, Oct. 2024. doi: 10.1109/TPAMI.2024.3382308
    [25]
    J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 5802–5811.
    [26]
    W. Zhao, S. Xie, F. Zhao, Y. He, and H. Lu, “MetaFusion: Infrared and visible image fusion via meta-feature embedding from object detection,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 13955–13965.
    [27]
    J. Liu, Z. Liu, G. Wu, L. Ma, R. Liu, W. Zhong, Z. Luo, and X. Fan, “Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 8115–8124.
    [28]
    Z. Liu, J. Liu, G. Wu, L. Ma, X. Fan, and R. Liu, “Bi-level dynamic learning for jointly multi-modality image fusion and beyond,” in Proc. 32nd Int. Joint Conf. Artificial Intelligence, Macao, China, 2023, pp. 1240–1248.
    [29]
    J. Li, J. Chen, J. Liu, and H. Ma, “Learning a graph neural network with cross modality interaction for image fusion,” in Proc. 31st ACM Int. Conf. Multimedia, Ottawa, Canada, 2023, pp. 4471–4479.
    [30]
    D. Wang, J. Liu, R. Liu, and X. Fan, “An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection,” Inf. Fusion, vol. 98, p. 101828, Oct. 2023. doi: 10.1016/j.inffus.2023.101828
    [31]
    Z. Zhao, S. Xu, J. Zhang, C. Liang, C. Zhang, and J. Liu, “Efficient and model-based infrared and visible image fusion via algorithm unrolling,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1186–1196, Mar. 2022. doi: 10.1109/TCSVT.2021.3075745
    [32]
    R. Nie, C. Ma, J. Cao, H. Ding, and D. Zhou, “A total variation with joint norms for infrared and visible image fusion,” IEEE Trans. Multimedia, vol. 24, pp. 1460–1472, Mar. 2022. doi: 10.1109/TMM.2021.3065496
    [33]
    R. Liu, Z. Liu, J. Liu, and X. Fan, “Searching a hierarchically aggregated fusion architecture for fast multi-modality image fusion,” in Proc. 29th ACM Int. Conf. Multimedia, China, 2021, pp. 1600–1608.
    [34]
    X. Tian, W. Zhang, D. Yu, and J. Ma, “Sparse tensor prior for hyperspectral, multispectral, and panchromatic image fusion,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 1, pp. 284–286, Jan. 2023. doi: 10.1109/JAS.2022.106013
    [35]
    J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, and Y. Ma, “SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 7, pp. 1200–1217, Jul. 2022. doi: 10.1109/JAS.2022.105686
    [36]
    X. Yi, H. Xu, H. Zhang, L. Tang, and J. Ma, “Text-IF: Leveraging semantic text guidance for degradation-aware and interactive image fusion,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 27016–27025.
    [37]
    D. Wu, M. Han, Y. Yang, S. Zhao, Y. Rao, H. Li, X. Lin, C. Zhou, and H. Bai, “DCFusion: A dual-frequency cross-enhanced fusion network for infrared and visible image fusion,” IEEE Trans. Instrum. Meas., vol. 72, p. 5011815, Apr. 2023.
    [38]
    H. Li and X.-J. Wu, “DenseFuse: A fusion approach to infrared and visible images,” IEEE Trans. Image Process., vol. 28, no. 5, pp. 2614–2623, May 2019. doi: 10.1109/TIP.2018.2887342
    [39]
    H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2Fusion: A unified unsupervised image fusion network,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1, pp. 502–518, Jan. 2022. doi: 10.1109/TPAMI.2020.3012548
    [40]
    J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “FusionGAN: A generative adversarial network for infrared and visible image fusion,” Inf. Fusion, vol. 48, pp. 11–26, Aug. 2019. doi: 10.1016/j.inffus.2018.09.004
    [41]
    D. Wang, J. Liu, X. Fan, and R. Liu, “Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration,” in Proc. 31st Int. Joint Conf. Artificial Intelligence, Austria, 2022, pp. 3508–3515.
    [42]
    Z. Liu, J. Liu, B. Zhang, L. Ma, X. Fan, and R. Liu, “PAIF: Perception-aware infrared-visible image fusion for attack-tolerant semantic segmentation,” in Proc. 31st ACM Int. Conf. Multimedia, Ottawa, Canada, 2023, pp. 3706–3714.
    [43]
    A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache, “Learning visual features from large weakly supervised data,” in Proc. 14th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 67–84.
    [44]
    A. Li, A. Jabri, A. Joulin, and L. van der Maaten, “Learning visual N-grams from web data,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 4183–4192.
    [45]
    K. Desai and J. Johnson, “VirTex: Learning visual representations from textual annotations,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, USA, 2021, pp. 11162–11173.
    [46]
    P. Müller, G. Kaissis, C. Zou, and D. Rueckert, “Joint learning of localized representations from medical images and reports,” in Proc. 17th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 685–701.
    [47]
    M. B. Sariyildiz, J. Perez, and D. Larlus, “Learning visual representations with caption annotations,” in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 153–170.
    [48]
    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Miami, USA, 2009, pp. 248–255.
    [49]
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 8748–8763.
    [50]
    M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “MaPLe: Multi-modal prompt learning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 19113–19122.
    [51]
    C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, “Domain adaptation via prompt learning,” IEEE Trans. Neural Netw. Learn. Syst., 2023, doi: 10.1109/TNNLS.2023.3327962.
    [52]
    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 16816–16825.
    [53]
    S. Ma, C.-W. Xie, Y. Wei, S. Sun, J. Fan, X. Bao, Y. Guo, and Y. Zheng, “Understanding the multi-modal prompts of the pre-trained vision-language model,” arXiv preprint arXiv: 2312.11570, 2024.
    [54]
    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, Jul. 2022. doi: 10.1007/s11263-022-01653-1
    [55]
    Z. Zhao, L. Deng, H. Bai, Y. Cui, Z. Zhang, Y. Zhang, H. Qin, D. Chen, J. Zhang, P. Wang, and L. Van Gool, “Image fusion via vision-language model,” in Proc. 41st Int. Conf. Machine Learning, Vienna, Austria, 2024, pp. 60749–60765.
    [56]
    X. Li, Y. Zou, J. Liu, Z. Jiang, L. Ma, X. Fan, and R. Liu, “From text to pixels: A context-aware semantic synergy solution for infrared and visible image fusion,” arXiv preprint arXiv: 2401.00421, 2023.
    [57]
    O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “StyleCLIP: Text-driven manipulation of styleGAN imagery,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 2065–2074.
    [58]
    R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “StyleGAN-NADA: CLIP-guided domain adaptation of image generators,” ACM Trans. Graphics, vol. 41, no. 4, p. 141, Jul. 2022.
    [59]
    M. Liu, L. Jiao, X. Liu, L. Li, F. Liu, and S. Yang, “C-CNN: Contourlet convolutional neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 6, pp. 2636–2649, Jun. 2021. doi: 10.1109/TNNLS.2020.3007412
    [60]
    L. Tang, J. Yuan, H. Zhang, X. Jiang, and J. Ma, “PIAFusion: A progressive infrared and visible image fusion network based on illumination aware,” Inf. Fusion, vol. 83-84, pp. 79–92, Jul. 2022. doi: 10.1016/j.inffus.2022.03.007
    [61]
    A. Toet and M. A. Hogervorst, “Progress in color night vision,” Opt. Eng., vol. 51, no. 1, p. 010901, Feb. 2012. doi: 10.1117/1.OE.51.1.010901
    [62]
    H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo, “FusionDN: A unified densely connected network for image fusion,” in Proc. 34th AAAI Conf. Artificial Intelligence, New York, USA, 2020, pp. 12484–12491.
    [63]
    Z. Zhao, H. Bai, J. Zhang, Y. Zhang, K. Zhang, S. Xu, D. Chen, R. Timofte, and L. Van Gool, “Equivariant multi-modality image fusion,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 25912–25921.
    [64]
    J. Liu, R. Lin, G. Wu, R. Liu, Z. Luo, and X. Fan, “CoCoNet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion,” Int. J. Comput. Vis., vol. 132, no. 5, pp. 1748–1775, May. 2024. doi: 10.1007/s11263-023-01952-1
    [65]
    H. Li, T. Xu, X.-J. Wu, J. Lu, and J. Kittler, “LRRNet: A novel representation learning guided fusion network for infrared and visible images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 9, pp. 11040–11052, Sep. 2023. doi: 10.1109/TPAMI.2023.3268209
    [66]
    Z. Zhao, H. Bai, Y. Zhu, J. Zhang, S. Xu, Y. Zhang, K. Zhang, D. Meng, R. Timofte, and L. Van Gool, “DDFM: Denoising diffusion model for multi-modality image fusion,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 8082–8093.
    [67]
    Z. Huang, J. Liu, X. Fan, R. Liu, W. Zhong, and Z. Luo, “ReCoNet: Recurrent correction network for fast and efficient multi-modality image fusion,” in Proc. 17th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 539–555.
    [68]
    J. Ma, Y. Ma, and C. Li, “Infrared and visible image fusion methods and applications: A survey,” Inf. Fusion, vol. 45, pp. 153–178, Jan. 2019. doi: 10.1016/j.inffus.2018.02.004
    [69]
    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” Proc. 35th Int. Conf. Neural Information Processing Systems, 2021, pp. 924.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(13)  / Tables(5)

    Article Metrics

    Article views (283) PDF downloads(220) Cited by()

    Highlights

    • To break through the bottleneck of task-oriented fusion, we propose PromptFusion, a semantic-guided fusion method that leverages textual prompts to bridge the semantic gaps between modalities, improving machine perception while preserving visual fidelity
    • To perceive the modal-specific information, we introduce the contourlet autoencoder, a frequency-aware spectra encoder that decomposes and aggregates the low- and high-pass subbands from infrared and visible images to improve the multi-modality feature integration
    • For superior downstream task performance, we developed a two-stage prompt learning framework that uses task-specific design prompts to constrain the fusion process, accurately distinguishing the targets and scenes by learning typical characteristics of modalities
    • To tackle the challenges in jointly optimizing image fusion and prompt learning, we introduce a bi-level asymptotic convergence optimization method approximating the complex bi-level problem into single-level tasks to ensure efficient resolution using gradient descent

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return