A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 15.3, Top 1 (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
Y. Yuan, G. Yang, James Z. Wang, H. Zhang, H. Shan, F. Wang, and J. Zhang, “Dissecting and mitigating semantic discrepancy in stable diffusion for image-to-image translation,” IEEE/CAA J. Autom. Sinica, 2024. doi: 10.1109/JAS.2024.124800
Citation: Y. Yuan, G. Yang, James Z. Wang, H. Zhang, H. Shan, F. Wang, and J. Zhang, “Dissecting and mitigating semantic discrepancy in stable diffusion for image-to-image translation,” IEEE/CAA J. Autom. Sinica, 2024. doi: 10.1109/JAS.2024.124800

Dissecting and Mitigating Semantic Discrepancy in Stable Diffusion for Image-to-Image Translation

doi: 10.1109/JAS.2024.124800
Funds:  This work was supported in part by the National Natural Science Foundation of China (No. 62176059). The work of James Z. Wang was supported by The Pennsylvania State University
More Information
  • Finding suitable initial noise that retains the original image’s information is crucial for image-to-image (I2I) translation using text-to-image (T2I) diffusion models. A common approach is to add random noise directly to the original image, as in SDEdit. However, we have observed that this can result in “semantic discrepancy” issues, wherein T2I diffusion models misinterpret the semantic relationships and generate content not present in the original image. We identify that the noise introduced by SDEdit disrupts the semantic integrity of the image, leading to unintended associations between unrelated regions after U-Net upsampling. Building on the widely-used latent diffusion model, Stable Diffusion, we propose a training-free, plug-and-play method to alleviate semantic discrepancy and enhance the fidelity of the translated image. By leveraging the deterministic nature of Denoising Diffusion Implicit Models (DDIMs) inversion, we correct the erroneous features and correlations from the original generative process with accurate ones from DDIM inversion. This approach alleviates semantic discrepancy and surpasses recent DDIM-inversion-based methods such as PnP with fewer priors, achieving a speedup of 11.2 times in experiments conducted on COCO, ImageNet, and ImageNet-R datasets across multiple I2I translation tasks. The codes are available at https://github.com/Sherlockyyf/Semantic_Discrepancy.

     

  • loading
  • [1]
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, vol. 27, 2014.
    [2]
    Y. Chen, Y. Lv, and F.-Y. Wang, “Traffic flow imputation using parallel data and generative adversarial networks,” IEEE Trans. Intelli gent Transportation Systems, vol. 21, no. 4, pp. 1624–1630, 2019.
    [3]
    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851.
    [4]
    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in Advances in Neural Information Processing Systems, 2021, pp. 8780–8794.
    [5]
    C. Wang, T. Chen, Z. Chen, Z. Huang, T. Jiang, Q. Wang, and H. Shan, “FLDM-VTON: Faithful latent diffusion model for virtual try-on,” in IJCAI, 2024.
    [6]
    Q. Gao, Z. Li, J. Zhang, Y. Zhang, and H. Shan, “CoreDiff: Contextual error-modulated generalized diffusion model for low-dose CT denoising and generalization,” IEEE Trans. Medical Imaging, vol. 43, no. 2, pp. 745–759, 2024. doi: 10.1109/TMI.2023.3320812
    [7]
    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Highresolution image synthesis with latent diffusion models,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2022, pp. 10684–10695.
    [8]
    A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” in Int. Conf. on Machine Learning, 2022, pp. 16 784–16 804.
    [9]
    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 36 479–36 494.
    [10]
    J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan et al., “Scaling autoregressive models for content-rich text-to-image generation,” Trans. Machine Learning Research, 2022.
    [11]
    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Int. Conf. on Machine Learning, 2021, pp. 8821–8831.
    [12]
    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with CLIP latents,” arXiv preprint arXiv:2204.06125, 2022.
    [13]
    J. Betker, G. Goh, and e. a. Li Jing, “Improving image generation with better captions,” 2023, https://cdn.openai.com/papers/dall-e-3.pdf.
    [14]
    K. Wang, C. Gou, N. Zheng, J. M. Rehg, and F.-Y. Wang, “Parallel vision for perception and understanding of complex scenes: methods, framework, and perspectives,” Artificial Intelligence Review, vol. 48, pp. 299–329, 2017. doi: 10.1007/s10462-017-9569-z
    [15]
    H. Zhang, G. Luo, Y. Li, and F.-Y. Wang, “Parallel vision for intelligent transportation systems in metaverse: Challenges, solutions, and potential applications,” IEEE Trans. Systems, Man, and Cyberne tics: Systems, vol. 53, pp. 3400–3413, 2022.
    [16]
    H. Zhang, Y. Tian, K. Wang, W. Zhang, and F.-Y. Wang, “Mask ssd: An effective single-stage approach to object instance segmentation,” IEEE Trans. Image Processing, vol. 29, pp. 2078–2093, 2019.
    [17]
    B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2023, pp. 6007–6017.
    [18]
    T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2023, pp. 18392–18402.
    [19]
    X. Li, J. Thickstun, I. Gulrajani, P. S. Liang, and T. B. Hashimoto, “Diffusion-lm improves controllable text generation,” Advances in Neural Information Processing Systems, vol. 35, pp. 4328–4343, 2022.
    [20]
    S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y. Liu, and Y. Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in Proc. the IEEE/CVF Int. Conf. on Computer Vision, 2023, pp. 22930–22941.
    [21]
    A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2022, pp. 11461–11471.
    [22]
    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. the IEEE/CVF Int. Conf. on Computer Vision, 2023, pp. 3836–3847.
    [23]
    J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proc. the IEEE/CVF Int. Conf. on Computer Vision, 2023, pp. 7623–7633.
    [24]
    C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “SDEdit: Guided image synthesis and editing with stochastic differential equations,” in Int. Conf. on Learning Representations, 2021.
    [25]
    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in Int. Conf. on Learning Representations, 2020.
    [26]
    S. Witteveen and M. Andrews, “Investigating prompt engineering in diffusion models,” arXiv preprint arXiv:2211.15462, 2022.
    [27]
    N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2023, pp. 1921–1930.
    [28]
    M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already have a semantic latent space,” arXiv preprint arXiv:2210.10960, 2022.
    [29]
    R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Nulltext inversion for editing real images using guided diffusion models,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2023, pp. 6038–6047.
    [30]
    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
    [31]
    D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Int. Conf. on Machine Learning, 2015, pp. 1530–1538.
    [32]
    L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent components estimation,” arXiv preprint arXiv:1410.8516, 2014.
    [33]
    D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
    [34]
    Z. Huang, S. Chen, J. Zhang, and H. Shan, “AgeFlow: Conditional age progression and regression with normalizing flows,” in IJCAI, 2021, pp. 743–750.
    [35]
    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. the IEEE Conf. on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.
    [36]
    J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. the IEEE Int. Conf. on Computer Vision, 2017, pp. 2223–2232.
    [37]
    Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning for image-to-image translation,” in Proc. the IEEE Int. Conf. on Computer Vision, 2017, pp. 2849–2857.
    [38]
    Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proc. the IEEE Conf. on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797.
    [39]
    A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: Anatomically-aware facial animation from a single image,” in Proc. the European Conf. on Computer Vision, 2018, pp. 818–833.
    [40]
    Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “Attgan: Facial attribute editing by only changing what you want,” IEEE Trans. Image Processing, vol. 28, no. 11, pp. 5464–5478, 2019. doi: 10.1109/TIP.2019.2916751
    [41]
    T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2019, pp. 2337–2346.
    [42]
    K. Zhang, Y. Su, X. Guo, L. Qi, and Z. Zhao, “Mu-gan: Facial attribute editing based on multi-attention mechanism,” IEEE/CAA Journal of Automatica Sinica, vol. 8, no. 9, pp. 1614–1626, 2021. doi: 10.1109/JAS.2020.1003390
    [43]
    M. Liu, Y. Ding, M. Xia, X. Liu, E. Ding, W. Zuo, and S. Wen, “StGAN: A unified selective transfer network for arbitrary image attribute editing,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2019, pp. 3673–3682.
    [44]
    E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: a stylegan encoder for image-to-image translation,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2021, pp. 2287–2296.
    [45]
    H. Lin, Y. Liu, S. Li, and X. Qu, “How generative adversarial networks promote the development of intelligent transportation systems: A survey,” IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 9, pp. 1781–1796, 2023. doi: 10.1109/JAS.2023.123744
    [46]
    K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, “Generative adversarial networks: introduction and outlook,” IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 4, pp. 588–598, 2017. doi: 10.1109/JAS.2017.7510583
    [47]
    Y. Yuan, S. Ma, and J. Zhang, “Vr-fam: Variance-reduced encoder with nonlinear transformation for facial attribute manipulation,” in ICASSP 2022-2022 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 1755–1759.
    [48]
    Y. Yuan, S. Ma, H. Shan, and J. Zhang, “Do-fam: Disentangled nonlinear latent navigation for facial attribute manipulation,” in ICASSP 2023-2023 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
    [49]
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Int. Conf. on Machine Learning, 2021, pp. 8748–8763.
    [50]
    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
    [51]
    G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” arXiv preprint arXiv:2210.11427, 2022.
    [52]
    B. Wallace, A. Gokul, and N. Naik, “Edict: Exact diffusion inversion via coupled transformations,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2023, pp. 22 532–22 541.
    [53]
    L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
    [54]
    A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-or, “Prompt-to-prompt image editing with cross-attention control,” in Int. Conf. on Learning Representations, 2022.
    [55]
    B. Liu, C. Wang, T. Cao, K. Jia, and J. Huang, “Towards understanding cross and self-attention in stable diffusion for text-guided image editing,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2024, pp. 7817–7826.
    [56]
    N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2023, pp. 22500–22510.
    [57]
    R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
    [58]
    H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv:2308.06721, 2023.
    [59]
    A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
    [60]
    D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proc. the IEEE/CVF Int. Conf. on Computer Vision, 2021, pp. 8340–8349.
    [61]
    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, October 5–9, 2015, Part Ⅲ 18. Springer, 2015, pp. 234–241.
    [62]
    A. Mackiewiczx and W. Ratajczak, “Principal components analysis (PCA),” Computers & Geosciences, vol. 19, no. 3, pp. 303–342, 1993.
    [63]
    C. Si, Z. Huang, Y. Jiang, and Z. Liu, “FreeU: Free lunch in diffusion U-net,” arXiv preprint arXiv:2309.11497, 2023.
    [64]
    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, x and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. the European Conf. on Computer Vision, 2014, pp. 740–755.
    [65]
    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” Int. Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. doi: 10.1007/s11263-015-0816-y
    [66]
    J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, pp. 679–698, 1986.
    [67]
    Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Asilomar Conf. on Signals, Systems, and Computers, vol. 2, 2003, pp. 1398–1402.
    [68]
    S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. the IEEE Int. Conf. on Computer Vision, 2015, pp. 1395–1403.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(15)  / Tables(5)

    Article Metrics

    Article views (17) PDF downloads(3) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return