IEEE/CAA Journal of Automatica Sinica
Citation: | P. Liu, Y. J. Zhou, D. Z. Peng, and D. P. Wu, "Global-Attention-Based Neural Networks for Vision Language Intelligence," IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1243-1252, Jul. 2021. doi: 10.1109/JAS.2020.1003402 |
[1] |
K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2048–2057.
|
[2] |
R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” in Proc. 31st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 595–603.
|
[3] |
Q. Z. You, H. L. Jin, Z. W. Wang, C. Fang, and J. B. Luo, “Image captioning with semantic attention,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4651–4659.
|
[4] |
P. Anderson, X. D. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 6077–6086.
|
[5] |
J. S. Lu, C. M. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 375–383.
|
[6] |
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 3156–3164.
|
[7] |
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 5998–6008.
|
[8] |
L. Huang, W. M. Wang, Y. X. Xia, and J. Chen, “Adaptively aligned image captioning via adaptive attention time,” in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 8940–8949.
|
[9] |
J. Wu, T. S. Chen, H. F. Wu, Z. Yang, Q. Wang, and L. Lin, “Concrete image captioning by integrating content sensitive and global discriminative objective,” in Proc. IEEE Int. Conf. Multimedia and Expo, Shanghai, China, 2019, pp. 1306–1311.
|
[10] |
J. X. Gu, S. Joty, J. F. Cai, H. D. Zhao, X. Yang, and G. Wang, “Unpaired image captioning via scene graph alignments,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Korea (South), 2019, pp. 10323–10332.
|
[11] |
L. Huang, W. M. Wang, J. Chen, and X. Y. Wei, “Attention on attention for image captioning,” arXiv e-prints, page arXiv: 1908.06954, Aug. 2019.
|
[12] |
K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 770–778.
|
[13] |
S. Q. Ren, K. M. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 91–99.
|
[14] |
K. M. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2961–2969.
|
[15] |
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv: 1409.0473, May 2016.
|
[16] |
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, May 2019.
|
[17] |
Y. C. Xu, X. D. Liu, Y. L. Shen, J. J. Liu, and J. F. Gao, “Multi-task learning with sample re-weighting for machine reading comprehension,” arXiv preprint arXiv: 1809.06963, Mar. 2019.
|
[18] |
M. Soh, “Learning CNN-LSTM architectures for image caption generation,” Stanford Univ., Stanford, USA, 2016.
|
[19] |
M. H. Chen, G. G. Ding, S. C. Zhao, H. Chen, J. G. Han, and Q. Liu, “Reference based LSTM for image captioning,” in Proc. 31st AAAI Conf. Artificial Intelligence, San Francisco, USA, 2017.
|
[20] |
S. M. Lakew, M. Cettolo, and M. Federico, “A comparison of transformer and recurrent neural networks on multilingual neural machine translation,” arXiv preprint arXiv: 1806.06957, Jun. 2018.
|
[21] |
C. G. Wang, Mu Li, and A. J. Smola, “Language models with transformers,” arXiv preprint arXiv: 1904.09408, Oct. 2019.
|
[22] |
J. Yu, J. Li, Z. Yu, and Q. M. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Trans. Circuits Syst. Video Technol, vol. 30, no. 12, pp. 4467–4480, Dec. 2020. doi: 10.1109/TCSVT.2019.2947482
|
[23] |
S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” arXiv preprint arXiv: 1906.05963, Jan. 2020.
|
[24] |
M. T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv: 1508.04025, Sep. 2015.
|
[25] |
M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” arXiv preprint arXiv: 1912.08226, Mar. 2020.
|
[26] |
T. Yao, Y. W. Pan, Y. H. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 684–699.
|
[27] |
L. L. Gao, Z. Guo, H. W. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based LSTM and semantic consistency,” IEEE Trans. Multimed., vol. 19, no. 9, pp. 2045–2055, Sep. 2017. doi: 10.1109/TMM.2017.2729019
|
[28] |
L. L. Gao, X. P. Li, J. K. Song, and H. T. Shen, “Hierarchical LSTMS with adaptive attention for visual captioning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 5, pp. 1112–1131, May 2019.
|
[29] |
Z. Gan, C. Gan, X. D. He, Y. C. Pu, K. Tran, J. F. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 5630–5639.
|
[30] |
Y. W. Pan, T. Yao, H. Q. Li, and T. Mei, “Video captioning with transferred semantic attributes,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 6504–6512.
|
[31] |
T. Yao, Y. W. Pan, Y. H. Li, Z. F. Qiu, and T. Mei, “Boosting image captioning with attributes,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 4894–4902.
|
[32] |
K. Fu, J. Q. Jin, R. P. Cui, F. Sha, and C. S. Zhang, “Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2321–2334, Dec. 2017. doi: 10.1109/TPAMI.2016.2642953
|
[33] |
F. Liu, T. Xiang, T. M. Hospedales, W. K. Yang, and C. Y. Sun, “Semantic regularisation for recurrent image annotation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 2872–2880.
|
[34] |
S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 7008–7024.
|
[35] |
R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 4566–4575.
|
[36] |
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
|
[37] |
A. Karpathy and F. F. Li, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 3128–3137.
|
[38] |
W. H. Jiang, L. Ma, Y. G. Jiang, W. Liu, and T. Zhang, “Recurrent fusion network for image captioning,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 499–515.
|
[39] |
P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” in Proc. 14th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 382–398.
|
[40] |
K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meet. Association for Computational Linguistics, Philadelphia, USA, 2002, pp. 311–318.
|
[41] |
M. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Proc. 9th Workshop on Statistical Machine Translation, Baltimore, USA, 2014, pp. 376–380.
|
[42] |
C. Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. Workship on Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74–81.
|
[43] |
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 1171–1179.
|