A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 12 Issue 3
Mar.  2025

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 15.3, Top 1 (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
J. Liu, X. Li, Z. Wang, Z. Jiang, W. Zhong, W. Fan, and  B. Xu,  “PromptFusion: Harmonized semantic prompt learning for infrared and visible image fusion,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 3, pp. 502–515, Mar. 2025. doi: 10.1109/JAS.2024.124878
Citation: J. Liu, X. Li, Z. Wang, Z. Jiang, W. Zhong, W. Fan, and  B. Xu,  “PromptFusion: Harmonized semantic prompt learning for infrared and visible image fusion,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 3, pp. 502–515, Mar. 2025. doi: 10.1109/JAS.2024.124878

PromptFusion: Harmonized Semantic Prompt Learning for Infrared and Visible Image Fusion

doi: 10.1109/JAS.2024.124878
Funds:  This work was partially supported by China Postdoctoral Science Foundation (2023M730741) and the National Natural Science Foundation of China (U22B2052, 52102432, 52202452, 62372080, 62302078)
More Information
  • The goal of infrared and visible image fusion (IVIF) is to integrate the unique advantages of both modalities to achieve a more comprehensive understanding of a scene. However, existing methods struggle to effectively handle modal disparities, resulting in visual degradation of the details and prominent targets of the fused images. To address these challenges, we introduce PromptFusion, a prompt-based approach that harmoniously combines multi-modality images under the guidance of semantic prompts. Firstly, to better characterize the features of different modalities, a contourlet autoencoder is designed to separate and extract the high-/low-frequency components of different modalities, thereby improving the extraction of fine details and textures. We also introduce a prompt learning mechanism using positive and negative prompts, leveraging Vision-Language Models to improve the fusion model’s understanding and identification of targets in multi-modality images, leading to improved performance in downstream tasks. Furthermore, we employ bi-level asymptotic convergence optimization. This approach simplifies the intricate non-singleton non-convex bi-level problem into a series of convergent and differentiable single optimization problems that can be effectively resolved through gradient descent. Our approach advances the state-of-the-art, delivering superior fusion quality and boosting the performance of related downstream tasks. Project page: https://github.com/hey-it-s-me/PromptFusion.


  • loading
  • [1]
    G. Pajares and J. M. de la Cruz, “A wavelet-based image fusion tutorial,” Pattern Recognit., vol. 37, no. 9, pp. 1855–1872, Sep. 2004. doi: 10.1016/j.patcog.2004.03.010
    S. Li, B. Yang, and J. Hu, “Performance comparison of different multi-resolution transforms for image fusion,” Inf. Fusion, vol. 12, no. 2, pp. 74–84, Apr. 2011. doi: 10.1016/j.inffus.2010.03.002
    Z. Zhang and R. S. Blum, “A categorization of multiscale-decomposition-based image fusion schemes with a performance study for a digital camera application,” Proc. IEEE, vol. 87, no. 8, pp. 1315–1326, Aug. 1999. doi: 10.1109/5.775414
    J. Wang, J. Peng, X. Feng, G. He, and J. Fan, “Fusion method for infrared and visible images by using non-negative sparse representation,” Infrared Phys. Technol., vol. 67, pp. 477–489, Nov. 2014. doi: 10.1016/j.infrared.2014.09.019
    S. Li, H. Yin, and L. Fang, “Group-sparse representation with dictionary learning for medical image denoising and fusion,” IEEE Trans. Biomed. Eng., vol. 59, no. 12, pp. 3450–3459, Dec. 2012. doi: 10.1109/TBME.2012.2217493
    R. Hou, D. Zhou, R. Nie, D. Liu, L. Xiong, Y. Guo, and C. Yu, “VIF-Net: An unsupervised framework for infrared and visible image fusion,” IEEE Trans. Comput. Imaging, vol. 6, pp. 640–651, Jan. 2020. doi: 10.1109/TCI.2020.2965304
    Z. Zhao, S. Xu, C. Zhang, J. Liu, P. Li, and J. Zhang, “DIDFuse: Deep image decomposition for infrared and visible image fusion,” in Proc. 29th Int. Joint Conf. Artificial Intelligence, Yokohama, Japan, 2020, pp. 970–976.
    H. Li, Y. Cen, Y. Liu, X. Chen, and Z. Yu, “Different input resolutions and arbitrary output resolution: A meta learning-based deep framework for infrared and visible image fusion,” IEEE Trans. Image Process., vol. 30, pp. 4070–4083, Apr. 2021. doi: 10.1109/TIP.2021.3069339
    J. Liu, X. Fan, J. Jiang, R. Liu, and Z. Luo, “Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 105–119, Jan. 2022. doi: 10.1109/TCSVT.2021.3056725
    H. Xu, J. Yuan, and J. Ma, “MURF: Mutually reinforcing multi-modal image registration and fusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 12148–12166, Oct. 2023. doi: 10.1109/TPAMI.2023.3283682
    H. Li, J. Liu, Y. Zhang, and Y. Liu, “A deep learning framework for infrared and visible image fusion without strict registration,” Int. J. Comput. Vis., vol. 132, no. 5, pp. 1625–1644, May. 2024. doi: 10.1007/s11263-023-01948-x
    Y. Rao, D. Wu, M. Han, T. Wang, Y. Yang, T. Lei, C. Zhou, H. Bai, and L. Xing, “AT-GAN: A generative adversarial network with attention and transition for infrared and visible image fusion,” Inf. Fusion, vol. 92, pp. 336–349, Apr. 2023. doi: 10.1016/j.inffus.2022.12.007
    H. Li, J. Zhao, J. Li, Z. Yu, and G. Lu, “Feature dynamic alignment and refinement for infrared-visible image fusion: Translation robust fusion,” Inf. Fusion, vol. 95, pp. 26–41, Jul. 2023. doi: 10.1016/j.inffus.2023.02.011
    J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion,” IEEE Trans. Image Process., vol. 29, pp. 4980–4995, Mar. 2020. doi: 10.1109/TIP.2020.2977573
    H. Xu, X. Wang, and J. Ma, “DRF: Disentangled representation for visible and infrared image fusion,” IEEE Trans. Instrum. Meas., vol. 70, p. 5006713, Feb. 2021.
    W. Tang, F. He, and Y. Liu, “YDTR: Infrared and visible image fusion via Y-shape dynamic transformer,” IEEE Trans. Multimedia, vol. 25, pp. 5413–5428, Jul. 2023. doi: 10.1109/TMM.2022.3192661
    M. Han, K. Yu, J. Qiu, H. Li, D. Wu, Y. Rao, Y. Yang, L. Xing, H. Bai, and C. Zhou, “Boosting target-level infrared and visible image fusion with regional information coordination,” Inf. Fusion, vol. 92, pp. 268–288, Apr. 2023. doi: 10.1016/j.inffus.2022.12.005
    J. Yue, L. Fang, S. Xia, Y. Deng, and J. Ma, “Dif-fusion: Toward high color fidelity in infrared and visible image fusion with diffusion models,” IEEE Trans. Image Process., vol. 32, pp. 5705–5720, Oct. 2023. doi: 10.1109/TIP.2023.3322046
    X. Yi, L. Tang, H. Zhang, H. Xu, and J. Ma, “Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior,” Inf. Fusion, vol. 110, p. 102450, Oct. 2024. doi: 10.1016/j.inffus.2024.102450
    Y. Liu, Y. Shi, F. Mu, J. Cheng, and X. Chen, “Glioma segmentation-oriented multi-modal MR image fusion with adversarial learning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 8, pp. 1528–1531, Aug. 2022. doi: 10.1109/JAS.2022.105770
    Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image fusion with convolutional sparse representation,” IEEE Signal Process. Lett., vol. 23, no. 12, pp. 1882–1886, Dec. 2016. doi: 10.1109/LSP.2016.2618776
    Y. Liu, C. Yu, J. Cheng, Z. J. Wang, and X. Chen, “MM-Net: A MixFormer-based multi-scale network for anatomical and functional image fusion,” IEEE Trans. Image Process., vol. 33, pp. 2197–2212, Mar. 2024. doi: 10.1109/TIP.2024.3374072
    Z. Zhao, H. Bai, J. Zhang, Y. Zhang, S. Xu, Z. Lin, R. Timofte, and L. Van Gool, “CDDFuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 5906–5916.
    R. Liu, Z. Liu, J. Liu, X. Fan, and Z. Luo, “A task-guided, implicitly-searched and meta-initialized deep model for image fusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6594–6609, Oct. 2024. doi: 10.1109/TPAMI.2024.3382308
    J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 5802–5811.
    W. Zhao, S. Xie, F. Zhao, Y. He, and H. Lu, “MetaFusion: Infrared and visible image fusion via meta-feature embedding from object detection,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 13955–13965.
    J. Liu, Z. Liu, G. Wu, L. Ma, R. Liu, W. Zhong, Z. Luo, and X. Fan, “Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 8115–8124.
    Z. Liu, J. Liu, G. Wu, L. Ma, X. Fan, and R. Liu, “Bi-level dynamic learning for jointly multi-modality image fusion and beyond,” in Proc. 32nd Int. Joint Conf. Artificial Intelligence, Macao, China, 2023, pp. 1240–1248.
    J. Li, J. Chen, J. Liu, and H. Ma, “Learning a graph neural network with cross modality interaction for image fusion,” in Proc. 31st ACM Int. Conf. Multimedia, Ottawa, Canada, 2023, pp. 4471–4479.
    D. Wang, J. Liu, R. Liu, and X. Fan, “An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection,” Inf. Fusion, vol. 98, p. 101828, Oct. 2023. doi: 10.1016/j.inffus.2023.101828
    Z. Zhao, S. Xu, J. Zhang, C. Liang, C. Zhang, and J. Liu, “Efficient and model-based infrared and visible image fusion via algorithm unrolling,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1186–1196, Mar. 2022. doi: 10.1109/TCSVT.2021.3075745
    R. Nie, C. Ma, J. Cao, H. Ding, and D. Zhou, “A total variation with joint norms for infrared and visible image fusion,” IEEE Trans. Multimedia, vol. 24, pp. 1460–1472, Mar. 2022. doi: 10.1109/TMM.2021.3065496
    R. Liu, Z. Liu, J. Liu, and X. Fan, “Searching a hierarchically aggregated fusion architecture for fast multi-modality image fusion,” in Proc. 29th ACM Int. Conf. Multimedia, China, 2021, pp. 1600–1608.
    X. Tian, W. Zhang, D. Yu, and J. Ma, “Sparse tensor prior for hyperspectral, multispectral, and panchromatic image fusion,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 1, pp. 284–286, Jan. 2023. doi: 10.1109/JAS.2022.106013
    J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, and Y. Ma, “SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 7, pp. 1200–1217, Jul. 2022. doi: 10.1109/JAS.2022.105686
    X. Yi, H. Xu, H. Zhang, L. Tang, and J. Ma, “Text-IF: Leveraging semantic text guidance for degradation-aware and interactive image fusion,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 27016–27025.
    D. Wu, M. Han, Y. Yang, S. Zhao, Y. Rao, H. Li, X. Lin, C. Zhou, and H. Bai, “DCFusion: A dual-frequency cross-enhanced fusion network for infrared and visible image fusion,” IEEE Trans. Instrum. Meas., vol. 72, p. 5011815, Apr. 2023.
    H. Li and X.-J. Wu, “DenseFuse: A fusion approach to infrared and visible images,” IEEE Trans. Image Process., vol. 28, no. 5, pp. 2614–2623, May 2019. doi: 10.1109/TIP.2018.2887342
    H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2Fusion: A unified unsupervised image fusion network,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1, pp. 502–518, Jan. 2022. doi: 10.1109/TPAMI.2020.3012548
    J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “FusionGAN: A generative adversarial network for infrared and visible image fusion,” Inf. Fusion, vol. 48, pp. 11–26, Aug. 2019. doi: 10.1016/j.inffus.2018.09.004
    D. Wang, J. Liu, X. Fan, and R. Liu, “Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration,” in Proc. 31st Int. Joint Conf. Artificial Intelligence, Austria, 2022, pp. 3508–3515.
    Z. Liu, J. Liu, B. Zhang, L. Ma, X. Fan, and R. Liu, “PAIF: Perception-aware infrared-visible image fusion for attack-tolerant semantic segmentation,” in Proc. 31st ACM Int. Conf. Multimedia, Ottawa, Canada, 2023, pp. 3706–3714.
    A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache, “Learning visual features from large weakly supervised data,” in Proc. 14th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 67–84.
    A. Li, A. Jabri, A. Joulin, and L. van der Maaten, “Learning visual N-grams from web data,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 4183–4192.
    K. Desai and J. Johnson, “VirTex: Learning visual representations from textual annotations,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, USA, 2021, pp. 11162–11173.
    P. Müller, G. Kaissis, C. Zou, and D. Rueckert, “Joint learning of localized representations from medical images and reports,” in Proc. 17th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 685–701.
    M. B. Sariyildiz, J. Perez, and D. Larlus, “Learning visual representations with caption annotations,” in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 153–170.
    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Miami, USA, 2009, pp. 248–255.
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 8748–8763.
    M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “MaPLe: Multi-modal prompt learning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 19113–19122.
    C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, “Domain adaptation via prompt learning,” IEEE Trans. Neural Netw. Learn. Syst., 2023, doi: 10.1109/TNNLS.2023.3327962.
    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 16816–16825.
    S. Ma, C.-W. Xie, Y. Wei, S. Sun, J. Fan, X. Bao, Y. Guo, and Y. Zheng, “Understanding the multi-modal prompts of the pre-trained vision-language model,” arXiv preprint arXiv: 2312.11570, 2024.
    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, Jul. 2022. doi: 10.1007/s11263-022-01653-1
    Z. Zhao, L. Deng, H. Bai, Y. Cui, Z. Zhang, Y. Zhang, H. Qin, D. Chen, J. Zhang, P. Wang, and L. Van Gool, “Image fusion via vision-language model,” in Proc. 41st Int. Conf. Machine Learning, Vienna, Austria, 2024, pp. 60749–60765.
    X. Li, Y. Zou, J. Liu, Z. Jiang, L. Ma, X. Fan, and R. Liu, “From text to pixels: A context-aware semantic synergy solution for infrared and visible image fusion,” arXiv preprint arXiv: 2401.00421, 2023.
    O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “StyleCLIP: Text-driven manipulation of styleGAN imagery,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 2065–2074.
    R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “StyleGAN-NADA: CLIP-guided domain adaptation of image generators,” ACM Trans. Graphics, vol. 41, no. 4, p. 141, Jul. 2022.
    M. Liu, L. Jiao, X. Liu, L. Li, F. Liu, and S. Yang, “C-CNN: Contourlet convolutional neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 6, pp. 2636–2649, Jun. 2021. doi: 10.1109/TNNLS.2020.3007412
    L. Tang, J. Yuan, H. Zhang, X. Jiang, and J. Ma, “PIAFusion: A progressive infrared and visible image fusion network based on illumination aware,” Inf. Fusion, vol. 83-84, pp. 79–92, Jul. 2022. doi: 10.1016/j.inffus.2022.03.007
    A. Toet and M. A. Hogervorst, “Progress in color night vision,” Opt. Eng., vol. 51, no. 1, p. 010901, Feb. 2012. doi: 10.1117/1.OE.51.1.010901
    H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo, “FusionDN: A unified densely connected network for image fusion,” in Proc. 34th AAAI Conf. Artificial Intelligence, New York, USA, 2020, pp. 12484–12491.
    Z. Zhao, H. Bai, J. Zhang, Y. Zhang, K. Zhang, S. Xu, D. Chen, R. Timofte, and L. Van Gool, “Equivariant multi-modality image fusion,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 25912–25921.
    J. Liu, R. Lin, G. Wu, R. Liu, Z. Luo, and X. Fan, “CoCoNet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion,” Int. J. Comput. Vis., vol. 132, no. 5, pp. 1748–1775, May. 2024. doi: 10.1007/s11263-023-01952-1
    H. Li, T. Xu, X.-J. Wu, J. Lu, and J. Kittler, “LRRNet: A novel representation learning guided fusion network for infrared and visible images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 9, pp. 11040–11052, Sep. 2023. doi: 10.1109/TPAMI.2023.3268209
    Z. Zhao, H. Bai, Y. Zhu, J. Zhang, S. Xu, Y. Zhang, K. Zhang, D. Meng, R. Timofte, and L. Van Gool, “DDFM: Denoising diffusion model for multi-modality image fusion,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 8082–8093.
    Z. Huang, J. Liu, X. Fan, R. Liu, W. Zhong, and Z. Luo, “ReCoNet: Recurrent correction network for fast and efficient multi-modality image fusion,” in Proc. 17th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 539–555.
    J. Ma, Y. Ma, and C. Li, “Infrared and visible image fusion methods and applications: A survey,” Inf. Fusion, vol. 45, pp. 153–178, Jan. 2019. doi: 10.1016/j.inffus.2018.02.004
    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” Proc. 35th Int. Conf. Neural Information Processing Systems, 2021, pp. 924.


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(13)  / Tables(5)

    Article Metrics

    Article views (283) PDF downloads(220) Cited by()


    • To break through the bottleneck of task-oriented fusion, we propose PromptFusion, a semantic-guided fusion method that leverages textual prompts to bridge the semantic gaps between modalities, improving machine perception while preserving visual fidelity
    • To perceive the modal-specific information, we introduce the contourlet autoencoder, a frequency-aware spectra encoder that decomposes and aggregates the low- and high-pass subbands from infrared and visible images to improve the multi-modality feature integration
    • For superior downstream task performance, we developed a two-stage prompt learning framework that uses task-specific design prompts to constrain the fusion process, accurately distinguishing the targets and scenes by learning typical characteristics of modalities
    • To tackle the challenges in jointly optimizing image fusion and prompt learning, we introduce a bi-level asymptotic convergence optimization method approximating the complex bi-level problem into single-level tasks to ensure efficient resolution using gradient descent


    DownLoad:  Full-Size Img  PowerPoint