IEEE/CAA Journal of Automatica Sinica
Citation: | Danyang Liu, Ji Xu, Pengyuan Zhang and Yonghong Yan, "Investigation of Knowledge Transfer Approaches to Improve the Acoustic Modeling of Vietnamese ASR System," IEEE/CAA J. Autom. Sinica, vol. 6, no. 5, pp. 1187-1195, Sept. 2019. doi: 10.1109/JAS.2019.1911693 |
[1] |
A. Sankar and C. H. Lee, " Maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. Speech &Audio Processing, vol. 4, no. 3, pp. 190–202, 1996.
|
[2] |
M. L. Seltzer, D. Yu, and Y. Wang, " An investigation of deep neural networks for noise robust speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada: IEEE, May 2013, pp. 7398–7402.
|
[3] |
L. Potamitis, N. Fakotakis, and G. Kokkinakis, " Independent component analysis applied to feature extraction for robust automatic speech recognition,” Electronics Letters, vol. 36, no. 23, pp. 1977–1978, 2000.
|
[4] |
G. Saon, H. K. J. Kuo, S. Rennie, and M. Picheny, " The IBM 2015 English conversational telephone speech recognition system,” Eurasip J. Advances in Signal Processing, vol. 2008, no. 1, pp. 1–15, 2015.
|
[5] |
R. Sahraeian and D. V. Compernolle, " Crosslingual and multilingual speech recognition based on the speech manifold,” IEEE/ACM Trans. Audio Speech &Language Processing, vol. 25, no. 12, pp. 2301–2312, 2017.
|
[6] |
L. Lu, A. Ghoshal, and S. Renals, " Maximum a posteriori adaptation of subspace gaussian mixture models for cross-lingual speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Kyoto, Japan: IEEE, Mar. 2012, pp. 4877–4880.
|
[7] |
K. C. Sim and H. Li, " Stream-based context-sensitive phone mapping for cross-lingual speech recognition,” in Proc. Interspeech 2009, Conf. the Int. Speech Communication Association, Brighton, United Kingdom, 2009, pp. 3019–3022.
|
[8] |
J. Kohler, " Multilingual phone models for vocabulary-independent ¨ speech recognition tasks,” Speech Communication, vol. 35, no. 12, pp. 21–30, 2001.
|
[9] |
Z. Tang, L. Li, and D. Wang, " Multi-task recurrent model for true multilingual speech recognition,” CoRR, vol. abs/1609.08337, 2016. [Online]. Available: http://arxiv.org/abs/1609.08337
|
[10] |
A. Mohan and R. Rose, " Multi-lingual speech recognition with low-rank multi-task deep neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Queensland, Australia: IEEE, Apr. 2015, pp. 4994–4998.
|
[11] |
S. Kim, T. Hori, and S. Watanabe, " Joint ctc-attention based end-to-end speech recognition using multi-task learning,” CoRR, vol. abs/1609.06773, 2016. [Online]. Available: http://arxiv.org/abs/1609.06773
|
[12] |
Z. Tang, L. Li, D. Wang, and R. C. Vipperla, " Collaborative joint training with multi-task recurrent model for speech and speaker recognition,” IEEE/ACM Trans. Audio Speech & Language Processing, vol. 25, no. 3, pp. 493–504, Mar. 2017.
|
[13] |
T. Robinson, M. Hochberg, and S. Renals, " IPA: improved phone modelling with recurrent neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Toulouse, France: IEEE, May 2006, pp. I/37–I/40 vol.1.
|
[14] |
T. Schultz, N. T. Vu, and T. Schlippe, " Globalphone: a multilingual text & speech database in 20 languages,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada: IEEE, May 2013, pp. 8126–8130.
|
[15] |
J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, " How transferable are features in deep neural networks?” in Proc. Int. Conf. Neural Information Processing Systems, Montreal, Canada, Dec. 2014, pp. 3320–3328.
|
[16] |
D. Yu, L. Deng, and G. E. Dahl, " Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition,” in Proc. of Nips Workshop on Deep Learning & Unsupervised Feature Learning, 2010.
|
[17] |
K. Yanai and Y. Kawano, " Food image recognition using deep convolutional network with pre-training and fine-tuning,” in Proc. IEEE Int. Conf. Multimedia & Expo Workshops, Torino, Italy: IEEE, Jun. 2015, pp. 1–6.
|
[18] |
A. Das and H. M. Johnson, " Cross-lingual transfer learning during supervised training in low resource scenarios,” in Proc. Interspeech 2009, Conf. the Int. Speech Communication Association, Dresden, Germany, Sep. 2015, pp. 3531–3535.
|
[19] |
E. Gasca, J. S. Snchez, and R. Alonso, " Eliminating redundancy and irrelevance using a new mlp-based feature selection method,” Pattern Recognition, vol. 39, no. 2, pp. 313–315, 2006.
|
[20] |
A. Asaei, B. Picart, and H. Bourlard, " Analysis of phone posterior feature space exploiting class-specific sparsity and mlp-based similarity measure,” in Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing, Dallas, Texas, USA: IEEE, Mar. 2010, pp. 4886–4889.
|
[21] |
V. Fontaine, C. Ris, and J. M. Boite, " Nonlinear discriminant analysis for improved speech recognition,” in Proc. European Conf. Speech Communication and Technology, Eurospeech, Rhodes, Greece, Sep. 1997.
|
[22] |
F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, " Probabilistic and bottle-neck features for lvcsr of meetings,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Honolulu, Hawaii, USA: IEEE, Apr. 2007, pp. IV–757– IV–760.
|
[23] |
F. Grzl, M. Karafit, and L. Burget, " Investigation into bottle-neck features for meeting speech recognition,” in Proc. Interspeech 2009, Conf. the Int. Speech Communication Association, Brighton, United Kingdom, Sep. 2009, pp. 2947–2950.
|
[24] |
K. Vesely, M. Karafit, F. Grzl, M. Janda, and E. Egorova, " The languageindependent bottleneck features,” in Proc. Spoken Language Technology Workshop, Miami, Florida, USA: IEEE, Dec. 2012, pp. 336–341.
|
[25] |
E. Chuangsuwanich, Y. Zhang, and J. Glass, " Multilingual data selection for training stacked bottleneck features,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China: IEEE, Mar. 2016, pp. 5410–5414.
|
[26] |
M. Tuerxun, S. Zhang, Y. Bao, and L. Dai, " Improvements on bottleneck feature for large vocabulary continuous speech recognition,” in Proc. Int. Conf. Signal Processing, Auckland, New Zealand, Dec. 2015, pp. 516–520.
|
[27] |
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, " Progressive neural networks,” CoRR, vol. abs/1606.04671, 2016. [Online]. Available: http://arxiv.org/abs/1606.04671
|
[28] |
J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, " Progressive neural networks for transfer learning in emotion recognition,” CoRR, vol. abs/1706.03256, 2017. [Online]. Available: http://arxiv.org/abs/1706.03256
|
[29] |
D. Povey, X. Zhang, and S. Khudanpur, " Parallel training of dnns with natural gradient and parameter averaging,” Eprint Arxiv, 2014.
|
[30] |
H. Yang and S. Amari, " Complexity issues in natural gradient descent method for training multilayer perceptrons,” Neural Computation, vol. 10, no. 8, pp. 2137, 1998.
|
[31] |
M. Rattray, D. Saad, and S. I. Amari, " Natural gradient descent for on-line learning,” Phys. rev. lett, vol. 81, no. 24, pp. 5461–5464, 1998.
|
[32] |
K. Vesely, M. Karafit, F. Grzl, M. Janda, and E. Egorova, " The languageindependent bottleneck features,” in Proc. Spoken Language Technology Workshop, Miami, Florida, USA: IEEE, Dec. 2013, pp. 336–341.
|
[33] |
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, " Auto-encoder bottleneck features using deep belief networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Kyoto, Japan: IEEE, Mar. 2012, pp. 4153–4156.
|
[34] |
D. Yu and M. L. Seltzer, " Improved bottleneck features using pretrained deep neural networks,” in Proc. Interspeech Conf. of the Int. Speech Communication Association, Florence, Italy: IEEE, Aug. 2011, pp. 237–240.
|
[35] |
J. Gehring, Y. Miao, F. Metze, and A. Waibel, " Extracting deep bottleneck features using stacked auto-encoders,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada: IEEE, May 2013, pp. 3377–3381.
|
[36] |
H. Hermansky, " Perceptual linear predictive (plp) analysis of speech,” J. the Acoustical Society of America, vol. 87, no. 4, pp. 1738, 1990.
|
[37] |
S. Imai, Cepstral analysis synthesis on the mel frequency scale, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Massachusetts, USA: IEEE, Apr. 1983, pp. 93–96.
|
[38] |
B. Chen, W. H. Chen, S. H. Lin, and W. Y. Chu, " Robust speech recognition using spatial-temporal feature distribution characteristics,” Pattern Recognition Letters, vol. 32, no. 7, pp. 919–926, 2011.
|
[39] |
Q. B. Nguyen, J. Gehring, M. Muller, S. Stuker, and A. Waibel, " Multilingual shifting deep bottleneck features for low-resource asr,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy: IEEE, May 2014, pp. 5607–5611.
|
[40] |
S. Wiesler, J. Li, and J. Xue, " Investigations on hessian-free optimization for cross-entropy training of deep neural networks, ” in Proc. Interspeech Conf. Int. Speech Communication Association, Lyon, France, Aug. 2013, pp. 3317–3321.
|
[41] |
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motliĉek, Y. Qian, and P. Schwarz, " The kaldi speech recognition toolkit,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, Hawaii, USA: IEEE Signal Processing Society, Dec. 2011.
|
[42] |
K. Zechner and A. Waibel, " Minimizing word error rate in textual summaries of spoken language,” in Proc. 1st Meeting of the North American Chapter of the Association for Computational Linguistics Naacl, Seattle, Washington, USA: Apr. 2000.
|
[43] |
P. Pujol, S. Pol, C. Nadeu, A. Hagen, and H. Bourlard, " Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system,” IEEE Trans. Speech &Audio Processing, vol. 13, no. 1, pp. 14–22, 2005.
|
[44] |
K. Hornik, M. Stinchcombe, and H. White, " Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
|
[45] |
W. Hartmann, R. Hsiao, and S. Tsakalidis, " Alternative networks for monolingual bottleneck features,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, LA, USA: IEEE, Mar. 2017, pp. 5290–5294.
|