A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 9 Issue 8
Aug.  2022

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 7.847, Top 10% (SCI Q1)
    CiteScore: 13.0, Top 5% (Q1)
    Google Scholar h5-index: 64, TOP 7
Turn off MathJax
Article Contents
Y. Ming, N. N. Hu, C. X. Fan, F. Feng, J. W. Zhou, and  H. Yu,  “Visuals to text : A comprehensive review on automatic image captioning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 8, pp. 1339–1365, Aug. 2022. doi: 10.1109/JAS.2022.105734
Citation: Y. Ming, N. N. Hu, C. X. Fan, F. Feng, J. W. Zhou, and  H. Yu,  “Visuals to text : A comprehensive review on automatic image captioning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 8, pp. 1339–1365, Aug. 2022. doi: 10.1109/JAS.2022.105734

Visuals to Text: A Comprehensive Review on Automatic Image Captioning

doi: 10.1109/JAS.2022.105734
Funds:  The work was supported by Beijing Natural Science Foundation of China (L201023), and the Natural Science Foundation of China (62076030)
More Information
  • Image captioning refers to automatic generation of descriptive texts according to the visual content of images. It is a technique integrating multiple disciplines including the computer vision (CV), natural language processing (NLP) and artificial intelligence. In recent years, substantial research efforts have been devoted to generate image caption with impressive progress. To summarize the recent advances in image captioning, we present a comprehensive review on image captioning, covering both traditional methods and recent deep learning-based techniques. Specifically, we first briefly review the early traditional works based on the retrieval and template. Then deep learning-based image captioning researches are focused, which is categorized into the encoder-decoder framework, attention mechanism and training strategies on the basis of model structures and training manners for a detailed introduction. After that, we summarize the publicly available datasets, evaluation metrics and those proposed for specific requirements, and then compare the state of the art methods on the MS COCO dataset. Finally, we provide some discussions on open challenges and future research directions.

     

  • loading
  • 1 We use the Flick8k dataset as the query corpus in this demo.
    2 https://cocodataset.org/3 https://competitions.codalab.org/competitions/32214 https://forms.illinois.edu/sec/1713398
    5 http://shannon.cs.illinois.edu/DenotationGraph/6 https://www.imageclef.org/photodata7 http://visualgenome.org/
    https://forms.illinois.edu/sec/1713398
    http://shannon.cs.illinois.edu/DenotationGraph/
    8 https://github.com/furkanbiten/GoodNews9 https://challenger.ai/10 https://vizwiz.org/11 https://nocaps.org/
    http://visualgenome.org/
    https://github.com/furkanbiten/GoodNews
    11 https://nocaps.org/
    https://vizwiz.org/
    https://nocaps.org/
    12 https://github.com/tylin/coco-caption
  • [1]
    S. P. Manay, S. A. Yaligar, Y. Thathva Sri Sai Reddy, and N. J. Saunshimath, “Image captioning for the visually impaired,” in Proc. Emerging Research in Computing, Information, Communication and Applications, Springer, 2022, pp. 511−522.
    [2]
    R. Hinami, Y. Matsui, and S. Satoh, “Region-based image retrieval revisited,” in Proc. 25th ACM Int. Conf. Multimedia, 2017, pp. 528−536.
    [3]
    E. Hand and R. Chellappa, “Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification,” in Proc. AAAI Conf. Artificial Intelligence, 2017, vol. 31, no. 1, pp. 4068−4074.
    [4]
    X. Cheng, J. Lu, J. Feng, B. Yuan, and J. Zhou, “Scene recognition with objectness,” Pattern Recognition, vol. 74, pp. 474–487, 2018. doi: 10.1016/j.patcog.2017.09.025
    [5]
    Z. Meng, L. Yu, N. Zhang, T. L. Berg, B. Damavandi, V. Singh, and A. Bearman, “Connecting what to say with where to look by modeling human attention traces,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2021, pp. 12679−12688.
    [6]
    L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen, “Deep learning for generic object detection: A survey,” Int. J. Computer Vision, vol. 128, no. 2, pp. 261–318, 2020. doi: 10.1007/s11263-019-01247-4
    [7]
    D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya, “Captioning images taken by people who are blind,” in Proc. European Conf. Computer Vision, Springer, 2020, pp. 417−434.
    [8]
    A. Kojima, T. Tamura, and K. Fukunaga, “Natural language description of human activities from video images based on concept hierarchy of actions,” Int. J. Computer Vision, vol. 50, no. 2, pp. 171–184, 2002. doi: 10.1023/A:1020346032608
    [9]
    P. Hède, P.-A. Moëllic, J. Bourgeoys, M. Joint, and C. Thomas, “Automatic generation of natural language description for images.” in Proc. RIAO, Citeseer, 2004, pp. 306−313.
    [10]
    S. Li, G. Kulkarni, T. Berg, A. Berg, and Y. Choi, “Composing simple image descriptions using web-scale n-grams,” in Proc. 15th Conf. Computational Natural Language Learning, 2011, pp. 220−228.
    [11]
    S. P. Liu, Y. T. Xian, H. F. Li, and Z. T. Yu, “Text detection in natural scene images using morphological component analysis and Laplacian dictionary,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 1, pp. 214–222, Jan. 2020.
    [12]
    A. Tran, A. Mathews, and L. Xie, “Transform and tell: Entity-aware news image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2020, pp. 13035−13045.
    [13]
    P. Kuznetsova, V. Ordonez, T. L. Berg, and Y. Choi, “Treetalk: Composition and compression of trees for image descriptions,” Trans. Association for Computational Linguistics, vol. 2, pp. 351–362, 2014. doi: 10.1162/tacl_a_00188
    [14]
    M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artificial Intelligence Research, vol. 47, pp. 853–899, 2013. doi: 10.1613/jair.3994
    [15]
    G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, “Babytalk: Understanding and generating simple image descriptions,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891–2903, 2013. doi: 10.1109/TPAMI.2012.162
    [16]
    M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. Han, A. Mensch, A.-d. Berg, T. Berg, and H. Daumé III, “Midge: Generating image descriptions from computer vision detec-tions,” in Proc. 13th Conf. European Chapter Association for Computational Linguistics, 2012, pp. 747−756.
    [17]
    Y. Yang, C. Teo, H. Daumé III, and Y. Aloimonos, “Corpus-guided sentence generation of natural images,” in Proc. Conf. Empirical Methods in Natural Language Processing, 2011, pp. 444−454.
    [18]
    W. N. H. W. Mohamed, M. N. M. Salleh, and A.-d. H. Omar, “A comparative study of reduced error pruning method in decision tree algorithms,” in Proc. IEEE Int. Conf. Control System, Computing and Engineering, 2012, pp. 392−397.
    [19]
    W. Liu, Z. Wang, Y. Yuan, N. Zeng, K. Hone, and X. Liu, “A novel sigmoid-function-based adaptive weighted particle swarm optimizer,” IEEE Trans. Cybernetics, vol. 51, no. 2, pp. 1085–1093, 2019. doi: 10.1109/TCYB.2019.2925015
    [20]
    S. Harford, F. Karim, and H. Darabi, “Generating adversarial samples on multivariate time series using variational autoencoders,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 9, pp. 1523–1538, Sept. 2021.
    [21]
    M. S. Sarafraz and M. S. Tavazoei, “A unified optimization-based framework to adjust consensus convergence rate and optimize the network topology in uncertain multi-agent systems,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 9, pp. 1539–1539, Sept. 2021.
    [22]
    Y. R. Wang, S. C. Gao, M. C. Zhou, and Y. Yu, “A multi-layered gravitational search algorithm for function optimization and real-world problems,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. 94–109, Jan. 2021.
    [23]
    K. H. Liu, Z. H. Ye, H. Y. Guo, D. P. Cao, L. Chen, and F.-Y. Wang, “FISS GAN: A generative adversarial network for foggy image semantic segmentation,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 8, pp. 1428–1439, Aug. 2021.
    [24]
    Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep learning for monocular depth estimation: A review,” Neurocomputing, vol. 438, no. 28, pp. 14–33, 2021.
    [25]
    K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Machine Learning, 2015, pp. 2048−2057.
    [26]
    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015 MS COCO image captioning challenge,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 652–663, 2016.
    [27]
    J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2017, pp. 375−383.
    [28]
    S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2017, pp. 7008−7024.
    [29]
    P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2018, pp. 6077−6086.
    [30]
    W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu, “CPTR: Full transformer network for image captioning,” arXiv preprint arXiv: 2101.10804, 2021.
    [31]
    R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” in Proc. Int. Conf. Machine Learning, 2014, pp. 595−603.
    [32]
    A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2015, pp. 3128−3137.
    [33]
    L. Wu, M. Xu, J. Wang, and S. Perry, “Recall what you see continually using grid LSTM in image captioning,” IEEE Trans. Multimedia, vol. 22, no. 3, pp. 808–818, 2019.
    [34]
    K. Lin, Z. Gan, and L. Wang, “Augmented partial mutual learning with frame masking for video captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2021, vol. 35, no. 3, pp. 2047−2055.
    [35]
    T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proc. European Conf. Computer Vision, 2018, pp. 684−699.
    [36]
    L. Guo, J. Liu, J. Tang, J. Li, W. Luo, and H. Lu, “Aligning linguistic words and visual semantic units for image captioning,” in Proc. 27th ACM Int. Conf. Multimedia, 2019, pp. 765−773.
    [37]
    X. Yang, H. Zhang, and J. Cai, “Auto-encoding and distilling scene graphs for image captioning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2313–2327, 2020. doi: 10.1109/TPAMI.2020.3042192
    [38]
    X. S. Li, Y. T. Liu, K. F. Wang, and F.-Y. Wang, “A recurrent attention and interaction model for pedestrian trajectory prediction,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 5, pp. 1361–1370, Sept. 2020.
    [39]
    P. Liu, Y. Zhou, D. Peng, and D. Wu, “Global-attention-based neural networks for vision language intelligence,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1243–1252, 2020.
    [40]
    T. L. Zhou, M. Chen, and J. Zou, “Reinforcement learning based data fusion method for multi-sensors,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 6, pp. 1489–1497, Nov. 2020.
    [41]
    P. H. Seo, P. Sharma, T. Levinboim, B. Han, and R. Soricut, “Reinforcing an image caption generator using off-line human feedback,” in Proc. AAAI Conf. Artificial Intelligence, 2020, vol. 34, no. 3, pp. 2693−2700.
    [42]
    J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Trans. Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4467–4480, 2020. doi: 10.1109/TCSVT.2019.2947482
    [43]
    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, 2018.
    [44]
    P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “VINVL: Revisiting visual representations in vision-language models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2021, pp. 5579−5588.
    [45]
    A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in Proc. European Conf. Computer Vision, Springer, 2010, pp. 15−29.
    [46]
    V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” Advances in Neural Information Processing Systems, vol. 24, pp. 1143–1151, 2011.
    [47]
    R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Trans. Association Computational Linguistics, vol. 2, pp. 207–218, 2014. doi: 10.1162/tacl_a_00177
    [48]
    R. Mason and E. Charniak, “Nonparametric method for data-driven image captioning,” in Proc. 52nd Annual Meeting Association for Computational Linguistics, 2014, vol. 2, pp. 592−598.
    [49]
    C. Sun, C. Gan, and R. Nevatia, “Automatic concept discovery from parallel text and visual corpora,” in Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 2596−2604.
    [50]
    A. Gupta, Y. Verma, and C. Jawahar, “Choosing linguistics over vision to describe images,” in Proc. AAAI Conf. Artificial Intelligence, 2012, vol. 26, no. 1.
    [51]
    J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, “Exploring nearest neighbor approaches for image captioning,” arXiv preprint arXiv: 1505.04467, 2015.
    [52]
    R. Xu, C. Xiong, W. Chen, and J. Corso, “Jointly modeling deep video and compositional text to bridge vision and language in a unified framework,” in Proc. AAAI Conf. Artificial Intelligence, 2015, vol. 29, no. 1.
    [53]
    R. Lebret, P. Pinheiro, and R. Collobert, “Phrase-based image captioning,” in Proc. Int. Conf. Machine Learning, 2015, pp. 2085−2094.
    [54]
    N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, K. Saenko, and S. Guadarrama, “Generating natural-language video descriptions using text-mined knowledge,” in Proc. AAAI Conf. Artificial Intelligence, 2013, vol. 27, no. 1.
    [55]
    I. U. Rahman, Z. Wang, W. Liu, B. Ye, M. Zakarya, and X. Liu, “An n-state markovian jumping particle swarm optimization algorithm,” IEEE Trans. Systems,Man,and Cybernetics: Systems, vol. 51, no. 11, pp. 6626–6638, 2020. doi: 10.1109/TSMC.2019.2958550
    [56]
    Y. Ushiku, M. Yamaguchi, Y. Mukuta, and T. Harada, “Common subspace for model and similarity: Phrase learning for caption generation from images,” in Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 2668−2676.
    [57]
    M. Muzahid, W. G. Wan, F. Sohel, L. Y. Wu, and L. Hou, “CurveNet: Curvature-based multitask learning deep networks for 3D object recognition,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 6, pp. 1177–1187, Jun. 2021.
    [58]
    Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van Den Hengel, “What value do explicit high level concepts have in vision to language problems?” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2016, pp. 203−212.
    [59]
    J. Gu, S. Joty, J. Cai, and G. Wang, “Unpaired image captioning by language pivoting,” in Proc. European Conf. Computer Vision, 2018, pp. 503−519.
    [60]
    J. Gamper and N. Rajpoot, “Multiple instance captioning: Learning representations from histo-pathology textbooks and articles,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2021, pp. 16549−16559.
    [61]
    Y. Song, S. Chen, Y. Zhao, and Q. Jin, “Unpaired cross-lingual image caption generation with self-supervised rewards,” in Proc. 27th ACM Int. Conf. Multi-Media, 2019, pp. 784−792.
    [62]
    J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2015, pp. 2625−2634.
    [63]
    X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji, “Rstnet: Captioning with adaptive attention on visual and non-visual words,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2021, pp. 15465−15474.
    [64]
    J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2016, pp. 11−20.
    [65]
    H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X.-D. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig, “From captions to visual concepts and back,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2015, pp. 1473−1482.
    [66]
    P. Anderson, S. Gould, and M. Johnson, “Partially-supervised image captioning,” arXiv preprint arXiv: 1806.06004, 2018.
    [67]
    R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 1440−1448.
    [68]
    S. Datta, K. Sikka, A. Roy, K. Ahuja, D. Parikh, and A. Divakaran, “Align2Ground: Weakly supervised phrase grounding guided by image-caption alignment,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2019, pp. 2601−2610.
    [69]
    X. Chen, M. Zhang, Z. Wang, L. Zuo, B. Li, and Y. Yang, “Leveraging unpaired out-of-domain data for image captioning,” Pattern Recognition Letters, vol. 132, pp. 132–140, 2020. doi: 10.1016/j.patrec.2018.12.018
    [70]
    D.-J. Kim, T.-H. Oh, J. Choi, and I. S. Kweon, “Dense relational image captioning via multi-task triple-stream networks,” arXiv preprint arXiv: 2010.03855, 2020.
    [71]
    L.-C. Yang, C.-Y. Yang, and J. Y.-j. Hsu, “Object relation attention for image paragraph captioning,” in Proc. AAAI Conf. Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3136−3144.
    [72]
    X. B. Hong, T. Zhang, Z. Cui, and J. Yang, “Variational gridded graph convolution network for node classification,” IEEE/CAA J. Autom.Sinica, vol. 8, no. 10, pp. 1697–1708, Oct. 2021.
    [73]
    X. Liu, M. Yan, L. Deng, G. Li, X. Ye, and D. Fan, “Sampling methods for efficient training of graph convolutional networks: A survey,” IEEE/CAA J. Autom. Sinica, vol. 2, no. 9, pp. 205–234, 2022.
    [74]
    X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 10685−10694.
    [75]
    X. Yang, H. Zhang, and J. Cai, “Learning to collocate neural modules for image captioning,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2019, pp. 4250−4260.
    [76]
    J. Gu, S. Joty, J. Cai, H. Zhao, X. Yang, and G. Wang, “Unpaired image captioning via scene graph alignments,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2019, pp. 10323−10332.
    [77]
    R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 5831−5840.
    [78]
    V. S. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-Fei, “Scene graph prediction with limited labels,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2019, pp. 2580−2590.
    [79]
    Y. Zhong, L. Wang, J. Chen, D. Yu, and Y. Li, “Comprehensive image captioning via scene graph decomposition,” in Proc. European Conf. Computer Vision, Springer, 2020, pp. 211−229.
    [80]
    S. Chen, Q. Jin, P. Wang, and Q. Wu, “Say as you wish: Fine-grained control of image caption generation with abstract scene graphs,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2020, pp. 9962−9971.
    [81]
    W. Zhang, H. Shi, S. Tang, J. Xiao, Q. Yu, and Y. Zhuang, “Consensus graph representation learning for better grounded image captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2021, pp.3394–3402.
    [82]
    S. Tripathi, K. Nguyen, T. Guha, B. Du, and T. Q. Nguyen, “Sg2caps: Revisiting scene graphs for image captioning,” arXiv preprint arXiv: 2102.04990, 2021.
    [83]
    D. Wang, D. Beck, and T. Cohn, “On the role of scene graphs in image captioning,” in Proc. Beyond Vision and Language: Integrating Real-World Knowledge, 2019, pp. 29−34.
    [84]
    V. S. J. Milewski, M. F. Moens, and I. Calixto, “Are scene graphs good enough to improve image captioning?” in Proc. 1st Conf. Asia-Pacific Chapter of Association for Computational Linguistics and 10th Int. Joint Conf. Natural Language Processing, 2020, pp. 504−515.
    [85]
    J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille, “Learning like a child: Fast novel visual concept learning from sentence descriptions of images,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2015, pp. 2533−2541.
    [86]
    L. Wang, A. G. Schwing, and S. Lazebnik, “Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space,” arXiv preprint arXiv: 1711.07068, 2017.
    [87]
    M. Wang, L. Song, X. Yang, and C. Luo, “A parallel-fusion RNN-LSTM architecture for image caption generation,” in IEEE Int. Conf. Image Processing, IEEE, 2016, pp. 4448−4452.
    [88]
    W. Jiang, L. Ma, X. Chen, H. Zhang, and W. Liu, “Learning to guide decoding for image captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2018, pp. 6959−6966.
    [89]
    Y. Xian and Y. Tian, “Self-guiding multimodal LSTM—When we do not have a perfect training dataset for image captioning?” IEEE Trans. Image Processing, vol. 28, no. 11, pp. 5241–5252, 2019. doi: 10.1109/TIP.2019.2917229
    [90]
    X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short term memory model for image caption generation,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2015, pp. 2407−2415.
    [91]
    J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-RNN),” arXiv preprint arXiv: 1412.6632, 2014.
    [92]
    C. Wang, H. Yang, C. Bartz, and C. Meinel, “Image captioning with deep bidirectional LSTMs,” in Proc. 24th ACM International Conf. Multimedia, 2016, pp. 988−997.
    [93]
    Y. Zheng, Y. Li, and S. Wang, “Intention oriented image captions with guiding objects,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 8395−8404.
    [94]
    I. Laina, C. Rupprecht, and N. Navab, “Towards unsupervised image captioning with shared multimodal embeddings,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2019, pp. 7414−7424.
    [95]
    W. J. Zhang, J. C. Wang, and F. P. Lan, “Dynamic hand gesture recognition based on short-term sampling neural networks,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. 110–120, Jan. 2021.
    [96]
    L. Liu, J. Tang, X. Wan, and Z. Guo, “Generating diverse and descriptive image captions using visual paraphrases,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2019, pp. 4240−4249.
    [97]
    G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, and J. Shao, “Context and attribute grounded dense captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 6241−6250.
    [98]
    J. Gu, J. Cai, G. Wang, and T. Chen, “Stack-captioning: Coarse-to-fine learning for image captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2018, vol. 32, no. 1, pp. 6837–6844.
    [99]
    D.-J. Kim, J. Choi, T.-H. Oh, and I. S. Kweon, “Dense relational captioning: Triple-stream networks for relationship-based captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 6271−6280.
    [100]
    Z. Song, X. Zhou, Z. Mao, and J. Tan, “Image captioning with context-aware auxiliary guidance,” in Proc. AAAI Conf. Artificial Intelligence, 2021, vol. 35, no. 3, pp. 2584−2592.
    [101]
    X. D. Zhao, Y. R. Chen, J. Guo, and D. B. Zhao, “A spatial-temporal attention model for human trajectory prediction,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 4, pp. 965–974, Jul. 2020.
    [102]
    L. Li, S. Tang, L. Deng, Y. Zhang, and Q. Tian, “Image caption with global-local attention,” in Proc. AAAI Conf. Artificial Intelligence, 2017, vol. 31, no. 1, pp. 4133−4239.
    [103]
    C. Wu, Y. Wei, X. Chu, F. Su, and L. Wang, “Modeling visual and word-conditional semantic attention for image captioning,” Signal Processing: Image Communication, vol. 67, pp. 100–107, 2018. doi: 10.1016/j.image.2018.06.002
    [104]
    Z. Zhang, Q. Wu, Y. Wang, and F. Chen, “Fine-grained and semantic-guided visual attention for image captioning,” in Proc. IEEE Winter Conf. Applications of Computer Vision, IEEE, 2018, pp. 1709−1717.
    [105]
    P. Cao, Z. Yang, L. Sun, Y. Liang, M. Q. Yang, and R. Guan, “Image captioning with bidirectional semantic attention-based guiding of long short-term memory,” Neural Processing Letters, vol. 50, no. 1, pp. 103–119, 2019. doi: 10.1007/s11063-018-09973-5
    [106]
    S. Wang, L. Lan, X. Zhang, G. Dong, and Z. Luo, “Object-aware semantics of attention for image captioning,” Multimedia Tools and Applications, vol. 79, no. 3, pp. 2013–2030, 2020.
    [107]
    L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2017, pp. 5659−5667.
    [108]
    J. Zhou, X. Wang, J. Han, S. Hu, and H. Gao, “Spatial-temporal attention for image captioning,” in Proc. IEEE Fourth Int. Conf. Multimedia Big Data, 2018, pp. 1−5.
    [109]
    J. Ji, C. Xu, X. Zhang, B. Wang, and X. Song, “Spatio-temporal memory attention for image captioning,” IEEE Trans. Image Processing, vol. 29, pp. 7615–7628, 2020. doi: 10.1109/TIP.2020.3004729
    [110]
    J. Lu, J. Yang, D. Batra, and D. Parikh, “Neural baby talk,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2018, pp. 7219−7228.
    [111]
    F. Xiao, X. Gong, Y. Zhang, Y. Shen, J. Li, and X. Gao, “DAA: Dual LSTMs with adaptive attention for image captioning,” Neurocomputing, vol. 364, pp. 322–329, 2019. doi: 10.1016/j.neucom.2019.06.085
    [112]
    Z. Deng, Z. Jiang, R. Lan, W. Huang, and X. Luo, “Image captioning using DenseNet network and adaptive attention,” Signal Processing: Image Communication, vol. 85, p. 115836, 2020.
    [113]
    C. Yan, Y. Hao, L. Li, J. Yin, A. Liu, Z. Mao, Z. Chen, and X. Gao, “Task-adaptive attention for image captioning,” IEEE Trans. Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 43–51, 2021. doi: 10.1109/TCSVT.2021.3067449
    [114]
    M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Paying more attention to saliency: Image captioning with saliency and context attention,” ACM Trans. Multimedia Computing,Communications,and Applications, vol. 14, no. 2, pp. 1–21, 2018.
    [115]
    J. Wang, W. Wang, L. Wang, Z. Wang, D. D. Feng, and T. Tan, “Learning visual relationship and context-aware attention for image captioning,” Pattern Recognition, vol. 98, p. 107075, 2020.
    [116]
    H. Chen, G. Ding, Z. Lin, Y. Guo, and J. Han, “Attend to knowledge: Memory-enhanced attention network for image captioning,” in Proc. Int. Conf. Brain Inspired Cognitive Systems, Springer, 2018, pp. 161−171.
    [117]
    T. Wang, X. Xu, F. Shen, and Y. Yang, “A cognitive memory-augmented network for visual anomaly detection,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1296–1307, Jul. 2021.
    [118]
    C. Xu, M. Yang, X. Ao, Y. Shen, R. Xu, and J. Tian, “Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning,” Knowledge-Based Systems, vol. 214, p. 106730, 2021.
    [119]
    Y. Cheng, F. Huang, L. Zhou, C. Jin, Y. Zhang, and T. Zhang, “A hierarchical multimodal attention-based neural network for image captioning,” in Proc. 40th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2017, pp. 889−892.
    [120]
    Q. Wang and A. B. Chan, “Gated hierarchical attention for image captioning,” in Proc. Asian Conf. Computer Vision, Springer, 2018, pp. 21−37.
    [121]
    W. Wang, Z. Chen, and H. Hu, “Hierarchical attention network for image captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, vol. 33, no. 1, pp. 8957−8964.
    [122]
    S. Yan, Y. Xie, F. Wu, J. S. Smith, W. Lu, and B. Zhang, “Image captioning via hierarchical attention mechanism and policy gradient optimization,” Signal Processing, vol. 167, p. 107329, 2020.
    [123]
    L. Gao, K. Fan, J. Song, X. Liu, X. Xu, and H. T. Shen, “Deliberate attention networks for image captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, vol. 33, no. 1, pp. 8320−8327.
    [124]
    Z. Zhang, Y. Wang, Q. Wu, and F. Chen, “Visual relationship attention for image captioning,” in Proc. IEEE Int. Joint Conf. Neural Networks, 2019, pp. 1−8.
    [125]
    Z. Zhang, Q. Wu, Y. Wang, and F. Chen, “Exploring region relationships implicitly: Image captioning with visual relationship attention,” Image and Vision Computing, vol. 109, p. 104146, 2021.
    [126]
    R. Del Chiaro, B. Twardowski, A. D. Bagdanov, and J. Van de Weijer, “Ratt: Recurrent attention to transient tasks for continual image captioning,” arXiv preprint arXiv: 2007.06271, 2020.
    [127]
    Y. Li, X. Zhang, J. Gu, C. Li, X. Wang, X. Tang, and L. Jiao, “Recurrent attention and semantic gate for remote sensing image captioning,” IEEE Trans. Geoscience and Remote Sensing, vol.60, p. 5608816, 2021.
    [128]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Advances in Neural Information Processing Systems, 2017, pp. 5998−6008.
    [129]
    Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2020, pp. 10971−10980.
    [130]
    J. Banzi, I. Bulugu, and Z. F. Ye, “Learning a deep predictive coding network for a semi-supervised 3D-hand pose estimation,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 5, pp. 1371–1379, Sept. 2020.
    [131]
    X. Zhu, L. Li, J. Liu, H. Peng, and X. Niu, “Captioning transformer with stacked attention modules,” Applied Sciences, vol. 8, no. 5, p. 739, 2018.
    [132]
    S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” in Proc. Advances in Neural Information Processing Systems, 2019, pp. 11137−11147.
    [133]
    L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, “Normalized and geometry-aware self-attention network for image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2020, pp. 10327−10336.
    [134]
    G. Li, L. Zhu, P. Liu, and Y. Yang, “Entangled transformer for image captioning,” in Proc. IEEE Int. Conf. Computer Vision, 2019, pp. 8928−8937.
    [135]
    M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2020, pp. 10578−10587.
    [136]
    Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C. Lin, and R. Ji, “Dual-level collaborative transformer for image captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2021, pp. 1−8.
    [137]
    C. Sundaramoorthy, L. Z. Kelvin, M. Sarin, and S. Gupta, “End-to-end attention-based image captioning,” arXiv preprint arXiv: 2104.14721, 2021.
    [138]
    N. Zeng, H. Li, Z. Wang, W. Liu, S. Liu, F. E. Alsaadi, and X. Liu, “Deep-reinforcement-learning-based images segmentation for quantitative analysis of gold immunochromatographic strip,” Neurocomputing, vol. 425, pp. 173–180, 2021. doi: 10.1016/j.neucom.2020.04.001
    [139]
    J. Ji, X. Sun, Y. Zhou, R. Ji, F. Chen, J. Liu, and Q. Tian, “Attacking image captioning towards accuracy-preserving target words removal,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 4226−4234.
    [140]
    T. Liu, B. Tian, Y. F. Ai, and F.-Y. Wang, “Parallel reinforcement learning-based energy efficiency improvement for a cyber-physical system,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 2, pp. 617–626, Mar. 2020.
    [141]
    M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” arXiv preprint arXiv: 1511.06732, 2015.
    [142]
    R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2015, pp. 4566−4575.
    [143]
    P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Proc. European Conf. Computer Vision, Springer, 2016, pp. 382−398.
    [144]
    S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of spider,” in Proc. IEEE Int. Conf. Computer Vision, 2017, pp. 873−881.
    [145]
    L. Zhang, F. Sung, L. Feng, T. Xiang, S. Gong, Y. Yang, and T. Hospedales, “Actor-critic sequence training for image captioning,” in Visually-Grounded Interaction and Language: NIPS 2017 Workshop, 2017, pp.1–10.
    [146]
    Y. Lin, J. McPhee, and N. L. Azad, “Comparison of deep reinforcement learning and model predictive control for adaptive cruise control,” IEEE Trans. Intelligent Vehicles, vol. 6, no. 2, pp. 221–231, Jun. 2021. doi: 10.1109/TIV.2020.3012947
    [147]
    C. Chen, S. Mu, W. Xiao, Z. Ye, L. Wu, and Q. Ju, “Improving image captioning with conditional generative adversarial nets,” in Proc. AAAI Conf. Artificial Intelligence, 2019, vol. 33, no. 1, pp. 8142−8150.
    [148]
    X. Shi, X. Yang, J. Gu, S. Joty, and J. Cai, “Finding it at another side: A viewpoint-adapted matching encoder for change captioning,” in Proc. European Conf. Computer Vision, Springer, 2020, pp. 574−590.
    [149]
    L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, “Unified vision-language pre-training for image captioning and VQA,” in Proc. AAAI Conf. Artificial Intelligence, 2020, vol. 34, no. 7, pp. 13041−13049.
    [150]
    X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in Proc. European Conf. Computer Vision, Springer, 2020, pp. 121−137.
    [151]
    X. Hu, X. Yin, K. Lin, L. Zhang, J. Gao, L. Wang, and Z. Liu, “Vivo: Visual vocabulary pre-training for novel object captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2021, vol. 35, no. 2, pp. 1575−1583.
    [152]
    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. European Conf. Computer Vision, Springer, 2014, pp. 740−755.
    [153]
    P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Association Computational Linguistics, vol. 2, pp. 67–78, 2014. doi: 10.1162/tacl_a_00166
    [154]
    B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 2641−2649.
    [155]
    M. Everingham, A. Zisserman, C. K. Williams, et al., “The 2005 pascal visual object classes challenge,” in Proc. Machine Learning Challenges Workshop, Springer, 2005, pp. 117−176.
    [156]
    B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “YFCC100m: The new data in multimedia research,” Commun. ACM, vol. 59, no. 2, pp. 64–73, 2016. doi: 10.1145/2812802
    [157]
    D. Elliott, S. Frank, K. Sima’an, and L. Specia, “Multi30k: Multilingual english-german image descriptions,” in Proc. 5th Workshop on Vision and Language, 2016, pp. 70−74.
    [158]
    J. Wu, H. Zheng, B. Zhao, Y. Li, B. Yan, R. Liang, W. Wang, S. Zhou, G. Lin, Y. Fu, Y. Z. Wang, and Y. G. Wang, “AI challenger: A large-scale dataset for going deeper in image understanding,” arXiv preprint arXiv: 1711.06475, 2017.
    [159]
    M. Grubinger, P. Clough, H. Müller, and T. Deselaers, “The iapr TC-12 benchmark: A new evaluation resource for visual information systems,” in Proc. Int. Workshop OntoImage, 2006, vol. 2, pp.13–23.
    [160]
    A. F. Biten, L. Gomez, M. Rusinol, and D. Karatzas, “Good news, everyone! Context driven entity-aware captioning for news images,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 12466−12475.
    [161]
    H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, “NOCAPS: Novel object captioning at scale,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2019, pp. 8948−8957.
    [162]
    X. Yang, H. Zhang, D. Jin, Y. Liu, C.-H. Wu, J. Tan, D. Xie, J. Wang, and X. Wang, “Fashion captioning: Towards generating accurate descriptions with semantic rewards,” in Proc. Computer Vision-ECCV, Springer, 2020, pp. 1−17.
    [163]
    O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, “Textcaps: A dataset for image captioning with reading comprehension,” in Proc. European Conf. Computer Vision, Springer, 2020, pp. 742−758.
    [164]
    R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, S. B. Michael, and F.-F. Li, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Computer Vision, vol. 123, no. 1, pp. 32–73, 2017. doi: 10.1007/s11263-016-0981-7
    [165]
    I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, S. Abu-El-Haija, S. Belongie, C. University, D. Cai, Z. Y. Feng, V. Ferrari, and V. Gomes, “Openimages: A public dataset for large-scale multi-label and multi-class image classification,” Dataset available from https://github.com/openimages, vol. 2, no. 3, p. 18, 2017.
    [166]
    K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annual Meeting Association for Computational Linguistics, 2002, pp. 311−318.
    [167]
    S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proc. ACL Workshop Intrinsic and Extrinsic Evaluation Measures for Machine Trans. and/or Summarization, 2005, vol. 29, pp. 65−72.
    [168]
    C. Lin and F. J. Och, “Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics,” in Proc. 42nd Annual Meeting of Association for Computational Linguistics, 2004, pp. 605−612.
    [169]
    A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, “Fast, diverse and accurate image captioning guided by part-of-speech,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 10695−10704.
    [170]
    J. Aneja, H. Agrawal, D. Batra, and A. Schwing, “Sequential latent spaces for modeling the intention during diverse image captioning,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2019, pp. 4261−4270.
    [171]
    Q. Wang and A. B. Chan, “Describing like humans: On diversity in image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 4195−4203.
    [172]
    Q. Wang, J. Wan, and A. B. Chan, “On diversity in image captioning: Metrics and methods,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 1035–1049, 2020. doi: 10.1109/TPAMI.2020.3013834
    [173]
    M. Jiang, Q. Huang, L. Zhang, X. Wang, P. Zhang, Z. Gan, J. Diesner, and J. Gao, “Tiger: Text-to-image grounding for image caption evaluation,” in Proc. Conf. Empirical Methods in Natural Language Processing and 9th Int. Joint Conf. Natural Language Processing, 2020, pp. 2141−2152.
    [174]
    S. Wang, Z. Yao, R. Wang, Z. Wu, and X. Chen, “Faier: Fidelity and adequacy ensured image caption evaluation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2021, pp. 14050−14059.
    [175]
    T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in Proc. Int. Conf. Learning Representations, 2019, pp. 1−43.
    [176]
    H. Lee, S. Yoon, F. Dernoncourt, D. S. Kim, T. Bui, and K. Jung, “Vilbertscore: Evaluating image caption using vision-and-language bert,” in Proc. 1st Workshop Evaluation and Comparison NLP Systems, 2020, pp. 34−39.
    [177]
    J. Wang, W. Xu, Q. Wang, and A. B. Chan, “Compare and reweight: Distinctive image captioning using similar images sets,” in Proc. European Conf. Computer Vision, Springer, 2020, pp. 370−386.
    [178]
    J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” arXiv preprint arXiv: 2104.08718, 2021.
    [179]
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” Image, vol. 2, p. T2, 2021.
    [180]
    J. Feinglass and Y. Yang, “Smurf: Semantic and linguistic understanding fusion for caption evaluation via typicality analysis,” arXiv preprint arXiv: 2106.01444, 2021.
    [181]
    H. Lee, S. Yoon, F. Dernoncourt, T. Bui, and K. Jung, “Umic: An unreferenced metric for image captioning via contrastive learning,” in Proc. 59th Annual Meeting of the Association for Computational Linguistics and 11th Int. Joint Conf. Natural Language Processing, 2021, pp. 220−226.
    [182]
    J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain images with multimodal recurrent neural networks,” arXiv preprint arXiv: 1410.1090, 2014.
    [183]
    A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. 27th Int. Conf. Neural Information Processing Systems, 2014, vol. 2, pp. 1889−1897.
    [184]
    X. Chen and C. L. Zitnick, “Mind’s eye: A recurrent visual representation for image caption generation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2015, pp. 2422−2431.
    [185]
    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2015, pp. 3156−3164.
    [186]
    K. Tran, X. He, L. Zhang, J. Sun, C. Carapcea, C. Thrasher, C. Buehler, and C. Sienkiewicz, “Rich image captioning in the wild,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2016, pp. 49−56.
    [187]
    L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell, “Deep compositional captioning: Describing novel object categories without paired training data,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2016, pp. 1−10.
    [188]
    L. Yang, K. Tang, J. Yang, and L.-J. Li, “Dense captioning with joint inference and visual context,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2017, pp. 2193−2202.
    [189]
    J. Gu, G. Wang, J. Cai, and T. Chen, “An empirical study of language cnn for image captioning,” in Proc. IEEE Int. Conf. Computer Vision, 2017, pp. 1222−1231.
    [190]
    Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel, “Image captioning and visual question answering based on attributes and external knowledge,” IEEE Trans. Pattern Analysis &Machine Intelligence, vol. 40, no. 6, pp. 1367–1381, 2018.
    [191]
    J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2018, pp. 5561−5570.
    [192]
    Q. Wang and A. B. Chan, “CNN + CNN: Convolutional decoders for image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2018, pp. 1−9.
    [193]
    X. Xiao, L. Wang, K. Ding, S. Xiang, and C. Pan, “Deep hierarchical encoder-decoder network for image captioning,” IEEE Trans. Multimedia, vol. 21, no. 11, pp. 2942–2956, 2019. doi: 10.1109/TMM.2019.2915033
    [194]
    Y. Qin, J. Du, Y. Zhang, and H. Lu, “Look back and predict forward in image captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 8367−8375.
    [195]
    J. Liu, K. Wang, C. Xu, Z. Zhao, R. Xu, Y. Shen, and M. Yang, “Interactive dual generative adversarial networks for image captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2020, vol. 34, no. 7, pp. 11588−11595.
    [196]
    Y. Wang, W. Zhang, Q. Liu, Z. Zhang, X. Gao, and X. Sun, “Improving intra-and inter-modality visual relation for image captioning,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 4190−4198.
    [197]
    A. Hu, S. Chen, and Q. Jin, “ICECAP: Information concentrated entity-aware image captioning,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 4217−4225.
    [198]
    Z. Fei, “Memory-augmented image captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2021, vol. 35, no. 2, pp. 1317−1324.
    [199]
    Y. Zhang, X. Shi, S. Mi, and X. Yang, “Image captioning with transformer and knowledge graph,” Pattern Recognition Letters, vol. 143, pp. 43–49, 2021. doi: 10.1016/j.patrec.2020.12.020
    [200]
    Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C.-W. Lin, and R. Ji, “Dual-level collaborative transformer for image captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2021, vol. 35, no. 3, pp. 2286−2293.
    [201]
    Z. Shi, X. Zhou, X. Qiu, and X. Zhu, “Improving image captioning with better use of caption,” in Proc. 58th Annual Meeting Association for Computational Linguistics, 2020, pp. 7454−7464.
    [202]
    L. Huang, W. Wang, J. Chen, and X. Wei, “Attention on attention for image captioning,” in Proc. IEEE Int. Conf. Computer Vision, 2019, pp. 4634−4643.
    [203]
    D. Liu, Z.-J. Zha, H. Zhang, Y. Zhang, and F. Wu, “Context-aware visual policy network for sequence-level image captioning,” in Proc. 26th ACM Int. Conf. Multimedia, 2018, pp. 1416−1424.
    [204]
    W. Jiang, L. Ma, Y.-G. Jiang, W. Liu, and T. Zhang, “Recurrent fusion network for image captioning,” in Proc. European Conf. Computer Vision, Springer, 2018, pp. 499−515.
    [205]
    T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2017, pp. 4894−4902.
    [206]
    T. Yao, Y. Pan, Y. Li, and T. Mei, “Hierarchy parsing for image captioning,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2019, pp. 2621−2629.
    [207]
    Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang, “More grounded image captioning by distilling image-text matching model,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2020, pp. 4777−4786.
    [208]
    L. Wang, Z. Bai, Y. Zhang, and H. Lu, “Show, recall, and tell: Image captioning with recall mechanism.” in Proc. AAAI Conf. Artificial Intelligence, 2020, pp. 12176−12183.
    [209]
    J. Ji, Y. Luo, X. Sun, F. Chen, G. Luo, Y. Wu, Y. Gao, and R. Ji, “Improving image captioning by leveraging intra-and inter-layer global representation in transformer network,” in Proc. AAAI Conf. Artificial Intelligence, 2021, vol. 35, no. 2, pp. 1655−1663.
    [210]
    M. Cornia, L. Baraldi, and R. Cucchiara, “Show, control and tell: A framework for generating controllable and grounded captions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 8307−8316.
    [211]
    C. Deng, N. Ding, M. Tan, and Q. Wu, “Length-controllable image captioning,” in Computer Vision-ECCV, Springer, 2020, pp. 712−729.
    [212]
    F. Sammani and L. Melas-Kyriazi, “Show, edit and tell: A framework for editing image captions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2020, pp. 4808−4816.
    [213]
    X. Li, S. Jiang, and J. Han, “Learning object context for dense captioning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, pp. 8650−8657.
    [214]
    S. Chen and Y.-G. Jiang, “Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2021, pp. 8425−8435.
    [215]
    K. Shuster, S. Humeau, H. Hu, A. Bordes, and J. Weston, “Engaging image captioning via personality,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2019, pp. 12516−12526.
    [216]
    R. Li, H. Liang, Y. Shi, F. Feng, and X. Wang, “Dual-CNN: A convolutional language decoder for paragraph image captioning,” Neurocomputing, vol. 396, pp. 92–101, 2020. doi: 10.1016/j.neucom.2020.02.041
    [217]
    Z. Fei, “Iterative back modification for faster image captioning,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 3182−3190.
    [218]
    L. Guo, J. Liu, X. Zhu, and H. Lu, “Fast sequence generation with multi-agent reinforcement learning,” arXiv preprint arXiv: 2101.09698, 2021.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(18)  / Tables(4)

    Article Metrics

    Article views (209) PDF downloads(91) Cited by()

    Highlights

    • Conducting a comprehensive review of image captioning, covering both traditional methods and recent deep learning-based techniques, as well as the publicly available datasets, evaluation metrics, the open issues and challenges
    • Focusing on the deep learning-based image captioning researches, which is categorized into the encoder-decoder framework, attention mechanism and training strategies on the basis of model structures and training manners for a detailed introduction. The performance of current state-of-the-art methods of image captioning is also compared and analyzed
    • Discussing several future research directions of image captioning, such as Flexible Captioning, Unpaired Captioning, Paragraph Captioning and Non-Autoregressive Captioning

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return