Recent Progresses in Deep Learning Based Acoustic Models

Dong Yu; Jinyu Li

doi:10.1109/JAS.2017.7510508

Volume 4 Issue 3

Jul. 2017

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2017 > 4(3): 396-409

Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning Based Acoustic Models," IEEE/CAA J. Autom. Sinica, vol. 4, no. 3, pp. 396-409, July 2017. doi: 10.1109/JAS.2017.7510508

Citation:

Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning Based Acoustic Models," IEEE/CAA J. Autom. Sinica, vol. 4, no. 3, pp. 396-409, July 2017. doi: 10.1109/JAS.2017.7510508

Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning Based Acoustic Models," IEEE/CAA J. Autom. Sinica, vol. 4, no. 3, pp. 396-409, July 2017. doi: 10.1109/JAS.2017.7510508

Citation:

Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning Based Acoustic Models," IEEE/CAA J. Autom. Sinica, vol. 4, no. 3, pp. 396-409, July 2017. doi: 10.1109/JAS.2017.7510508

PDF( 46750 KB)

Recent Progresses in Deep Learning Based Acoustic Models

doi: 10.1109/JAS.2017.7510508

Dong Yu^{1
,
,},
Jinyu Li^2
,

1.
with the Tencent AI Lab, Tencent, Bellevue, WA 98034, USA
2.
TMicrosoft AI and Research, Microsoft, Redmmond, WA 98052, USA

More Information

Author Bio:
Dong Yu is a distinguished scientist and vice general manager at Tencent AI Lab. Prior to joining Tencent, he was a principal researcher at Microsoft Research. His research has been focusing on speech recognition and other applications of machine learning techniques. He has published two monographs and more than 160 papers in these areas and is the coinventor of 50+ granted and 10+ pending patents and the leader of CNTK. His works have been recognized by the prestigious IEEE Signal Processing Society 2013 and 2016 best paper awards and the ACMSE 2005 best paper award. Dr. Dong Yu is currently serving as a member of the IEEE Speech and Language Processing Technical Committee (2013-2018), the vice chair of IEEE Seattle section (2017-), and the distinguished lecturer of APSIPA (2017-2018). He has served as an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing (2011-2015), an associate editor of the IEEE Signal Processing Magazine (2008-2011), the lead guest editor of the IEEE Transactions on Audio, Speech, and Language Processing-special issue on deep learning for speech and language processing (2010-2011), a guest editor of the IEEE/CAA Journal of Automatica Sinica-special issue on deep learning in audio, image, and text processing (2015-2016), and members of organization and technical committees of many conferences and workshops.(e-mail: dongyu@ieee.org)

Jinyu Li received the bachelor and master degree from University of Science and Technology of China, in 1997 and 2000, with the highest honor, and the Ph.D. degree from Georgia Institute of Technology, Atlanta, in 2008. From 2000 to 2003, he was a researcher in the Intel China Research Center and Research Manager in iFlytek Speech, China. Currently, he is a principal applied scientist and technical lead in Microsoft Corporation, Redmond, WA. He leads a team to design and improve speech modeling algorithms and technologies that ensure industry state-of-the-art speech recognition accuracy for Microsoft products such as Cortana and xBox Kinect. His major research interests cover several topics in speech recognition, including deep learning, noise robustness, discriminative training, feature extraction, and machine learning methods. He authored more than 70 refereed publications and around 20 patents. He is the leading author of the book Robust Automatic Speech Recognition-A Bridge to Practical Applications, Academic Press, Oct, 2015. Currently, he serves as the associate editor of IEEE/ACM Transactions on Audio, Speech and Language Processing.(e-mail: jinyli@microsoft.com)
Corresponding author: Dong Yu, e-mail:dongyu@ieee.org
Received Date: 2017-04-11
Accepted Date: 2017-05-24

Abstract

Abstract

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques. We first discuss models such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) that can effectively exploit variablelength contextual information, and their various combination with other models. We then describe models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system, the connectionist temporal classification (CTC) criterion, and the attention-based sequenceto-sequence translation model. We further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation, and robust training strategies. We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.

FullText(HTML)

References(150)

References

[1]	D. Yu, L. Deng, and G. E. Dahl, "Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition, " in Proc. NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning, 2010. https://www.researchgate.net/publication/228631482_Roles_of_Pre-Training_and_Fine-Tuning_in_Context-Dependent_DBN-HMMs_for_Real-World_Speech_Recognition
[2]	G. E. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pretrained deep neural networks for large-vocabulary speech recognition, " IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 30-42, Jan. 2012.
[3]	D. Yu, F. Seide, and G. Li, "Conversational speech transcription using context-dependent deep neural networks, " in Proc. 29th Int. Conf. Int. Conf. Machine Learning, Edinburgh, Scotland, 2011, pp. 437-440. http://www.isca-speech.org/archive/archive_papers/interspeech_2011/i11_0437.pdf
[4]	G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups, " IEEE Signal Processing Mag., vol. 29, no. 6, pp. 82-97, Nov. 2012.
[5]	O. Abdel-Hamid, A. R. Mohamed, H. Jiang, and G. Penn, "Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, " in Proc. 2012 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Kyoto, Japan, 2012, pp. 4277-4280. https://www.researchgate.net/publication/261119155_Applying_Convolutional_Neural_Networks_concepts_to_hybrid_NN-HMM_model_for_speech_recognition
[6]	L. Deng, J. Li, J. T. Huang, K. S. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. D. He, J. Williams, Y. F. Gong, and A. Acero, "Recent advances in deep learning for speech research at microsoft, " in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 8604-8608.
[7]	O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition, " IEEE/ACM Trans. Audio Speech Lang Processing, vol. 22, no. 10, pp. 1533-1545, Oct. 2014.
[8]	H. Sak, A. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling, " in 15th Proc. Interspeech, Singapore, Singapore, 2014, pp. 338-342. https://www.researchgate.net/publication/279714069_Long_short-term_memory_recurrent_neural_network_architectures_for_large_scale_acoustic_modeling
[9]	H. Sak, A. Senior, K. Rao, O. İrsoy, A. Graves, F. Beaufays, and J. Schalkwyk, "Learning acoustic frame labeling for speech recognition with recurrent neural networks, " in Prco. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing. Brisbane, QLD, Australia, 2015, pp. 4280-4284. https://www.researchgate.net/publication/304525733_Learning_acoustic_frame_labeling_for_speech_recognition_with_recurrent_neural_networks
[10]	T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, "Convolutional, long short-term memory, fully connected deep neural networks, " in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015, pp. 4580-4584. https://www.researchgate.net/publication/308872979_Convolutional_Long_Short-Term_Memory_fully_connected_Deep_Neural_Networks
[11]	M. X. Bi, Y. M. Qian, and K. Yu, "Very deep convolutional neural networks for LVCSR, " in 16th Proc. Interspeech, Dresden, Germany, 2015, pp. 3259-3263. http://www.isca-speech.org/archive/interspeech_2015/i15_3259.html
[12]	V. Mitra and H. Franco, "Time-frequency convolutional networks for robust speech recognition, " in Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA, 2015, pp. 317-323. https://www.sri.com/sites/default/files/publications/time-frequency_convolutional_networks_for_robust_speech_recognition.pdf
[13]	V. Peddinti, D. Povey, and S. Khudanpur, "A time delay neural network architecture for efficient modeling of long temporal contexts, " in 16th Proc. Interspeech, Dresden, Germany, 2015, pp. 3214-3218. http://www.isca-speech.org/archive/interspeech_2015/papers/i15_3214.pdf
[14]	T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, "Very deep multilingual convolutional neural networks for LVCSR, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 4955-4959. http://cims.nyu.edu/~ts2387/talks/sercu_icassp16_verydeepCNN.pdf
[15]	D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. D. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. X. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y. Wang, Z. Q. Wang, C. Wang, B. Xiao, D. N. Yogatama, J. Zhan, and Z. Y. Zhu, "Deep speech 2:End-to-end speech recognition in English and Mandarin, " arXiv:1512.02595, 2015. https://www.researchgate.net/publication/286513561_Deep_Speech_2_End-to-End_Speech_Recognition_in_English_and_Mandarin
[16]	S. L. Zhang, C. Liu, H. Jiang, S. Wei, L. R. Dai, and Y. Hu, "Feedforward sequential memory networks:A new structure to learn long-term dependency, " arXiv:1512.08301, 2015. http://arxiv.org/pdf/1512.08301
[17]	D. Yu, W. Xiong, J. Droppo, A. Stolcke, G. L. Ye, J. Li, and G. Zweig, "Deep convolutional neural networks with layer-wise context expansion and attention, " in 17th Proc. Interspeech, San Francisco, USA, 2016. http://www.isca-speech.org/archive/Interspeech_2016/pdfs/0251.PDF
[18]	H. Soltau, H. Liao, and H. Sak, "Neural speech recognizer:acousticto-word LSTM model for large vocabulary speech recognition, " arXiv:1610.09975, 2016. https://www.researchgate.net/publication/309572693_Neural_Speech_Recognizer_Acoustic-to-Word_LSTM_Model_for_Large_Vocabulary_Speech_Recognition
[19]	W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, "Achieving human parity in conversational speech recognition, " arXiv:1610.05256, 2016. https://www.researchgate.net/publication/309207213_Achieving_Human_Parity_in_Conversational_Speech_Recognition
[20]	D. Yu, M. Kolbaek, Z. H. Tan, and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation, " arXiv:1607.00325, 2017. https://www.researchgate.net/publication/304758178_Permutation_Invariant_Training_of_Deep_Models_for_Speaker-Independent_Multi-talker_Speech_Separation
[21]	M. Kolbaek, D. Yu, Z. H. Tan, and J. Jensen, "Multi-talker speech separation and tracing with permutation invariant training of deep recurrent neural networks, " arXiv:1703.06284, 2017. https://www.researchgate.net/publication/315455037_Multi-talker_Speech_Separation_and_Tracing_with_Permutation_Invariant_Training_of_Deep_Recurrent_Neural_Networks
[22]	D. Yu and L. Deng, Automatic Speech Recognition:A Deep Learning approach. London:Springer, 2015. http://www.worldcat.org/title/automatic-speech-recognition-a-deep-learning-approach/oclc/895161787
[23]	S. Hochreiter and J. Schmidhuber, "Long short-term memory, " Neural Comput., vol. 9, no. 8, pp. 1735-1780, Nov. 1997.
[24]	A. Graves, A. R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks, " in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 6645-6649. http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf
[25]	X. G. Li and X. H. Wu, "Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition, " in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015, pp. 4520-4524. https://www.researchgate.net/publication/267046395_Constructing_Long_Short-Term_Memory_based_Deep_Recurrent_Neural_Networks_for_Large_Vocabulary_Speech_Recognition
[26]	Y. J. Miao and F. Metze, "On speaker adaptation of long shortterm memory recurrent neural networks, " in 16th Proc. Interspeech, Dresden, Germany, 2015, pp. 1101-1105. https://www.cs.cmu.edu/~ymiao/pub/is2015_lstm.pdf
[27]	Y. J. Miao, J. Li, Y. Q. Wang, S. X. Zhang, and Y. F. Gong, "Simplifying long short-term memory acoustic models for fast training and decoding, " in Prco. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016. https://www.researchgate.net/publication/304372410_Simplifying_long_short-term_memory_acoustic_models_for_fast_training_and_decoding
[28]	J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling, " arXiv:1412.3555, 2014. https://www.researchgate.net/publication/269416998_Empirical_Evaluation_of_Gated_Recurrent_Neural_Networks_on_Sequence_Modeling
[29]	Y. Zhang, G. G. Chen, D. Yu, K. S. Yao, S. Khudanpur, and J. Glass, "Highway long short-term memory RNNS for distant speech recognition, " in Proc. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 2016. https://www.researchgate.net/publication/304372363_Highway_long_short-term_memory_RNNS_for_distant_speech_recognition
[30]	Y. Y. Zhao, S. Xu, and B. Xu, "Multidimensional residual learning based on recurrent neural networks for acoustic modeling, " in 17th Proc. Interspeech, San Francisco, USA, 2016, pp. 3419-3423. https://www.researchgate.net/publication/307889265_Multidimensional_Residual_Learning_Based_on_Recurrent_Neural_Networks_for_Acoustic_Modeling
[31]	J. Kim, M. El-Khamy, and J. Lee, "Residual LSTM:Design of a deep recurrent architecture for distant speech recognition, " arXiv:1701.03360, 2017. https://www.researchgate.net/publication/312283320_Residual_LSTM_Design_of_a_Deep_Recurrent_Architecture_for_Distant_Speech_Recognition
[32]	K. He, X. Y. Zhang, S. Q. Ren, and J. Sun, "Deep residual learning for image recognition, " arXiv:1512.03385, 2015. https://www.researchgate.net/publication/286512696_Deep_Residual_Learning_for_Image_Recognition
[33]	A. R. Mohamed, G. Hinton, and G. Penn, "Understanding how deep belief networks perform acoustic modelling, " in Proc. 2012 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Kyoto, Japan, 2012, pp. 4273-4276. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.224.2314&rep=rep1&type=pdf
[34]	J. Li, D. Yu, J. T. Huang, and Y. F. Gong, "Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM, " in Proc. 2012 IEEE Spoken Language Technology Workshop, Miami, FL, USA, 2012, pp. 131-136. https://www.researchgate.net/publication/261421822_Improving_wideband_speech_recognition_using_mixed-bandwidth_training_data_in_CD-DNN-HMM
[35]	J. Li, A. Mohamed, G. Zweig, and Y. F. Gong, "LSTM time and frequency recurrence for automatic speech recognition, " in Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA, 2015. https://www.researchgate.net/publication/300412435_LSTM_time_and_frequency_recurrence_for_automatic_speech_recognition
[36]	J. Li, A. Mohamed, G. Zweig, and Y. F. Gong, "Exploring multidimensional LSTMs for large vocabulary ASR, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016. https://www.researchgate.net/publication/304372654_Exploring_multidimensional_lstms_for_large_vocabulary_ASR
[37]	T. N. Sainath and B. Li, "Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks, " in 17th Proc. Interspeech, San Francisco, USA, 2016. http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45401.pdf
[38]	N. Kalchbrenner, I. Danihelka, and A. Graves, "Grid long short-term memory, " arXiv:1507.01526, 2015. https://www.researchgate.net/publication/279864537_Grid_Long_Short-Term_Memory
[39]	W. N. Hsu, Y. Zhang, and J. Glass, "A prioritized grid long short-term memory RNN for speech recognition, " in Proc. 2016 IEEE Spoken Language Technology Workshop (SLT), San Diego, California, USA, 2016, pp. 467-473. http://people.csail.mit.edu/jrg/2016/Wei-Ning-SLT-16.pdf
[40]	A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional LSTM and other neural network architectures, " Neural Netw., vol. 18, no. 5-6, pp. 602-610, Jul.-Aug. 2005.
[41]	S. F. Xue and Z. J. Yan, "Improving latency-controlled BLSTM acoustic models for online speech recognition, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, USA, 2017. http://linkinghub.elsevier.com/retrieve/pii/S0167639304001086
[42]	Y. LeCun and Y. Bengio, "Convolutional networks for images, speech, and time-series, " in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. Cambridge:MIT Press, 1995.
[43]	K. J. Lang, A. H. Waibel, and G. E. Hinton, "A time-delay neural network architecture for isolated word recognition, " Neural Netw., vol. 3, no. 1, pp. 23-43, Dec. 1990.
[44]	O. Abdel-Hamid, L. Deng, and D. Yu, "Exploring convolutional neural network structures and optimization techniques for speech recognition, " in 14th Proc. Interspeech, Lyon, France, 2013, pp. 3366-3370. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.703.648
[45]	T. N. Sainath, A. R. Mohamed, B. Kingsbury, and B. Ramabhadran, "Deep convolutional neural networks for LVCSR, " in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 8614-8618. https://www.researchgate.net/publication/261153442_Deep_convolutional_neural_networks_for_LVCSR
[46]	T. Sercu and V. Goel, "Dense prediction on sequences with time-dilated convolutions for speech recognition, " arXiv:1611.09288, 2016. https://www.researchgate.net/publication/311066943_Dense_Prediction_on_Sequences_with_Time-Dilated_Convolutions_for_Speech_Recognition
[47]	L. Tóth, "Modeling long temporal contexts in convolutional neural network-based phone recognition, " in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015, pp. 4575-4579. https://www.researchgate.net/publication/308862096_Modeling_long_temporal_contexts_in_convolutional_neural_network-based_phone_recognition
[48]	T. Zhao, Y. X. Zhao, and X. Chen, "Time-frequency kernel-based CNN for speech recognition, " in 16th Proc. Interspeech, Dresden, Germany, 2015. https://www.researchgate.net/publication/293653051_Time-frequency_kernel_based_CNN_for_speech_recognition
[49]	N. Jaitly and G. Hinton, "Learning a better representation of speech soundwaves using restricted boltzmann machines, " in Proc. 2011 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Prague, Czech Republic, 2011, pp. 5884-5887. http://www.academia.edu/11686402/Learning_a_better_representation_of_speech_soundwaves_using_restricted_boltzmann_machines
[50]	D. Palaz, R. Collobert, and M. Magimai-Doss, "Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks, " in 14th Proc. Interspeech, Lyon, France, 2014. https://www.researchgate.net/publication/237145551_Estimating_Phoneme_Class_Conditional_Probabilities_from_Raw_Speech_Signal_using_Convolutional_Neural_Networks
[51]	Z. Tüske, P. Golik, R. Schl uter, and H. Ney, "Acoustic modeling with deep neural networks using raw time signal for LVCSR, " in 15th Proc. Interspeech, Singapore, Singapore, 2014, pp. 890-894. http://www.academia.edu/18376509/Acoustic_Modeling_with_Deep_Neural_Networks_Using_Raw_Time_Signal_for_LVCSR
[52]	T. N. Sainath, R. J. Weiss, A. W. Senior, K. W. Wilson, and O. Vinyals, "Learning the speech front-end with raw waveform CLDNNs, " in 16th Proc. Interspeech, Dresden, Germany, 2015, pp. 1-5. http://www.ee.columbia.edu/~ronw/pubs/interspeech2015-waveform_cldnn.pdf
[53]	H. Dinkel, N. X. Chen, Y. M. Qian, and K. Yu, "End-to-end spoofing detection with raw waveform CLDNNS, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, USA, 2017. https://arxiv.org/pdf/1610.00564v1
[54]	T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto, C. Z. Yu, W. J. Fabian, M. Espi, T. Higuchi, S. Araki, and T. Nakatani, "The NTT chime-3 system:Advances in speech enhancement and recognition for mobile multi-microphone devices, " in Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 2015, pp. 436-443.
[55]	X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer, G. G. Chen, Y. Zhang, M. Mandel, and D. Yu, "Deep beamforming networks for multi-channel speech recognition, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 5745-5749.
[56]	T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, M. Bacchiani et al., "Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms, " in Proc. 2015 IEEE Int. Conf. Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 2015, pp. 30-36. http://www.ee.columbia.edu/~ronw/pubs/asru2015-multichannel_cldnn.pdf
[57]	T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, and M. Bacchiani, "Factored spatial and spectral multichannel raw waveform CLDNNS, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 5075-5079. http://www.ee.columbia.edu/~ronw/pubs/icassp2016-factored_cldnn.pdf
[58]	T. N. Sainath, R. J. Weiss, K. W. Wilson, B. Li, A. Narayanan, E. Variani, M. Bacchiani, I. Shafran, A. Senior, K. W. Chin, A. Misra, and C. Kim, "Multichannel signal processing with deep neural networks for automatic speech recognition, " IEEE/ACM Trans. Audio Speech Language Processing, vol. 25, no. 5, pp. 965-979, May 2017.
[59]	E. Variani, T. N. Sainath, I. Shafran, and M. Bacchiani, "Complex linear projection (CLP):A discriminative approach to joint feature extraction and acoustic modeling, " in 17th Proc. Interspeech, San Francisco, USA, 2016, pp. 808-812.
[60]	H. Sak, A. Senior, K. Rao, and F. Beaufays, "Fast and accurate recurrent neural network acoustic models for speech recognition, " in 16th Proc. Interspeech, Dresden, Germany, 2015. http://www.isca-speech.org/archive/interspeech_2015/papers/i15_1468.pdf
[61]	A. Senior, H. Sak, F. de Chaumont Quitry, T. Sainath, and K. Rao, "Acoustic modelling with CD-CTC-SMBR LSTM RNNS, " in Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 2015, pp. 604-609. https://www.researchgate.net/publication/304407558_Acoustic_modelling_with_CD-CTC-SMBR_LSTM_RNNS
[62]	A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connec-tionist temporal classification:labelling unsegmented sequence data with recurrent neural networks, " in Proc. the 23rd Int. Conf. Machine Learning, Pittsburgh, Pennsylvania, USA, 2006, pp. 369-376. http://www.docin.com/p-262226469.html
[63]	A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, "Deep speech:Scaling up end-to-end speech recognition, " arXiv:1412.5567, 2014. https://www.researchgate.net/publication/269722411_DeepSpeech_Scaling_up_end-to-end_speech_recognition
[64]	Y. Miao, M. Gowayyed, and F. Metze, "EESEN:End-to-end speech recognition using deep RNN models and WFST-based decoding, " in Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 2015, pp. 167-174. https://www.researchgate.net/publication/280589906_EESEN_End-to-end_speech_recognition_using_deep_RNN_models_and_WFST-based_decoding
[65]	Y. J. Miao, M. Gowayyed, X. Y. Na, T. Ko, F. Metze, and A. Waibel, "An empirical exploration of ctc acoustic models, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 2623-2627. https://www.cs.cmu.edu/~ymiao/pub/icassp2016_ctc.pdf
[66]	K. Rao and H. Sak, "Multi-accent speech recognition with hierarchical grapheme based models, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, USA, 2017. http://www.cstr.ed.ac.uk/downloads/publications/2013/gaelic_graphemes_icassp13.pdf
[67]	G. Zweig, C. Z. Yu, J. Droppo, and A. Stolcke, "Advances in all-neural speech recognition, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, USA, 2017.
[68]	H. R. Liu, Z. Y. Zhu, X. G. Li, and S. Satheesh, "Gram-CTC:Automatic unit selection and target decomposition for sequence labelling, " arXiv:1703.00096, 2017. https://www.researchgate.net/publication/314153409_Gram-CTC_Automatic_Unit_Selection_and_Target_Decomposition_for_Sequence_Labelling
[69]	Z. H. Chen, Y. M. Zhuang, Y. M. Qian, and K. Yu, "Phone synchronous speech recognition with ctc lattices, " IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 90-101, 2017. doi: 10.1109/TASLP.2016.2625459
[70]	D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Y. Na, Y. M. Wang, and S. Khudanpur, "Purely sequence-trained neural networks for ASR based on lattice-free MMI, " in 17th Proc. Interspeech, San Francisco, USA, 2016. http://www.isca-speech.org/archive/Interspeech_2016/pdfs/0595.PDF
[71]	D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, "End-to-end attention-based large vocabulary speech recognition, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 4945-4949. http://mirlab.org/conference_papers/International_Conference/ICASSP%202016/pdfs/0004945.pdf
[72]	W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell:A neural network for large vocabulary conversational speech recognition, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 4960-4964. https://www.coursehero.com/file/17867718/Listen-attend-and-spell-A-neural-network-for-large-vocabulary-conversational-speech-recognition-ica/
[73]	D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate, " arXiv:1409.0473, 2014. http://www.cs.bilkent.edu.tr/~gcinbis/courses/Spring17/CS559/presentations/W13b_KeremSener.pptx
[74]	V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, "Recurrent models of visual attention, " in Advances in Neural Information Processing Systems 27:28th Annual Conference on Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2204-2212. http://www.doc88.com/p-0367697812831.html
[75]	K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using RNN encoder-decoder for statistical machine translation, " arXiv:1406.1078, 2014. http://www.academia.edu/11462276/Learning_Phrase_Representations_using_RNN_Encoder-Decoder_for_Statistical_Machine_Translation
[76]	S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-toend speech recognition using multi-task learning, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, New Orleans, USA, 2017. https://www.merl.com/publications/docs/TR2017-016.pdf
[77]	Y. Zhang, W. Chan, and N. Jaitly, "Very deep convolutional networks for end-to-end speech recognition, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, New Orleans, USA, 2017. https://www.researchgate.net/publication/308981069_Very_Deep_Convolutional_Networks_for_End-to-End_Speech_Recognition
[78]	J. Li, L. Deng, Y. F. Gong, and R. Haeb-Umbach, "An overview of noise-robust automatic speech recognition, " IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745-777, Apr. 2014.
[79]	J. Li, L. Deng, R. Haeb-Umbach, and Y. F. Gong, Robust Automatic Speech Recognition:A Bridge to Practical Applications, Waltham:Academic Press, 2015.
[80]	F. Seide, G. Li, X. Chen, and D. Yu, "Feature engineering in contextdependent deep neural networks for conversational speech transcription, " in Proc. 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 2011, pp. 24-29. https://www.researchgate.net/publication/239765773_Feature_engineering_in_Context-Dependent_Deep_Neural_Networks_for_conversational_speech_transcription
[81]	H. Liao, "Speaker adaptation of context dependent deep neural networks, " in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 7947-7951. https://www.researchgate.net/profile/Natalia_Tomashenko/publication/286249884_Speaker_adaptation_of_context_dependent_deep_neural_networks_based_on_MAP-adaptation_and_GMM-derived_feature_processing/links/56671e5e08ae34c89a022a03.pdf?origin=publication_list
[82]	D. Yu, K. S. Yao, H. Su, G. Li, and F. Seide, "KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition, " in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 7893-7897. https://www.researchgate.net/publication/261194718_KL-divergence_regularized_deep_neural_network_adaptation_for_improved_large_vocabulary_speech_recognition
[83]	Z. Huang, J. Li, S. M. Siniscalchi, I. F. Chen, J. Wu, and C. H. Lee, "Rapid adaptation for deep neural networks through multitask learning, in 16th Proc. Interspeech, Dresden, Germany, 2015, pp. 3625-3629. http://staff.ustc.edu.cn/~jundu/Speech%20signal%20processing/publications/IS2015_Xu.pdf
[84]	J. Xue, J. Li, D. Yu, M. Seltzer, and Y. F. Gong, "Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network, " in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy, 2014, pp. 6359-6363. https://www.researchgate.net/publication/269295270_Singular_value_decomposition_based_low-footprint_speaker_adaptation_and_personalization_for_deep_neural_network
[85]	J. Xue, J. Li, and Y. F. Gong, "Restructuring of deep neural network acoustic models with singular value decomposition, " in 14th Proc. Interspeech, Lyon, France, 2013, pp. 2365-2369. http://www.microsoft.com/en-us/research/wp-content/uploads/2013/01/svd_v2.pdf
[86]	P. Swietojanski and S. Renals, "Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models, " in Proc. 2014 IEEE Spoken Language Technology Workshop, South Lake Tahoe, NV, USA, 2014. http://www.cstr.ed.ac.uk/downloads/publications/2014/ps-slt14.pdf
[87]	P. Swietojanski, J. Li, and S. Renals, "Learning hidden unit contributions for unsupervised acoustic model adaptation, " IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450-1463, Aug. 2016.
[88]	Y. Zhao, J. Li, J. Xue, and Y. F. Gong, "Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data, " in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015, pp. 4310-4314. https://www.researchgate.net/publication/285612590_Investigating_online_low-footprint_speaker_adaptation_using_generalized_linear_regression_and_click-through_data
[89]	G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, "Speaker adaptation of neural network acoustic models using i-vectors, " in Proc. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 2013, pp. 55-59. https://www.researchgate.net/profile/George_Saon/publication/261485126_Speaker_adaptation_of_neural_network_acoustic_models_using_i-vectors/links/558d70f108ae15962d8939c7.pdf
[90]	A. Senior and I. Lopez-Moreno, "Improving DNN speaker independence with i-vector inputs, " in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy, 2014, pp. 225-229. https://www.researchgate.net/publication/269294930_Improving_DNN_speaker_independence_with_I-vector_inputs
[91]	O. Abdel-Hamid and H. Jiang, "Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code, " in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 7942-7946. https://www.researchgate.net/publication/261500509_Fast_speaker_adaptation_of_hybrid_NNHMM_model_for_speech_recognition_based_on_discriminative_learning_of_speaker_code
[92]	M. L. Seltzer, D. Yu, and Y. Q. Wang, "An investigation of deep neural networks for noise robust speech recognition, " in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 7398-7402. https://www.researchgate.net/publication/261125912_An_investigation_of_deep_neural_networks_for_noise_robust_speech_recognition
[93]	D. Yu and L. Deng, "Adaptation of deep neural networks, " in Automatic Speech Recognition, D. Yu and L. Deng, Eds. London:Springer, 2015, pp. 193-215.
[94]	Y. J. Miao, H. Zhang, and F. Metze, "Towards speaker adaptive training of deep neural network acoustic models, " in 15th Proc. Interspeech, Singapore, Singapore, 2014. http://repository.cmu.edu/cgi/viewcontent.cgi?article=1068&context=lti
[95]	J. Li, J. T. Huang, and Y. F. Gong, "Factorized adaptation for deep neural network, " in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy, 2014. https://www.researchgate.net/publication/271468476_Factorized_adaptation_for_deep_neural_network
[96]	T. Tan, Y. M. Qian, M. F. Yin, Y. M. Zhuang, and K. Yu, "Cluster adaptive training for deep neural network, " in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 2015, pp. 4325-4329. https://www.deepdyve.com/lp/institute-of-electrical-and-electronics-engineers/cluster-adaptive-training-for-deep-neural-network-3uPnOT7eJV
[97]	C. Y. Wu and M. J. F. Gales, "Multi-basis adaptive neural network for rapid adaptation in speech recognition, " in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015, pp. 4315-4319. https://www.researchgate.net/publication/285612709_Multi-basis_adaptive_neural_network_for_rapid_adaptation_in_speech_recognition
[98]	L. Samarakoon and K. C. Sim, "Factorized hidden layer adaptation for deep neural network based acoustic modeling, " IEEE/ACM Trans. Audio Speech Language Processing, vol. 24, no. 12, pp. 2241-2250, 2016. doi: 10.1109/TASLP.2016.2601146
[99]	L. Samarakoon, K. C. Sim, and B. Mak, "An investigation into learning effective speaker subspaces for robust unsupervised DNN adaptation, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, New Orleans, USA, 2017. https://www.researchgate.net/publication/273137714_An_investigation_into_speaker_informed_DNN_front-end_for_LVCSR
[100]	R. Kuhn, P. Nguyen, J. C. Junqua, L. Goldwasser, N. Niedzielski, S. Fincke, K. L. Field, and M. Contolini, "Eigenvoices for speaker adaptation, " in Proc. 1998 IEEE the 5th Int. Conf. Spoken Language Processing, Sydney, Australia, 1998, pp. 1774-1777. https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester1_2007_8/kuhn-junqua-eigenvoice-icslp1998.pdf
[101]	M. J. F. Gales, "Cluster adaptive training for speech recognition, " in Proc. 5th Int. Conf. Spoken Language Processing, Sydney, Australia, 1998, pp. Article ID 0375. https://www.researchgate.net/publication/289138649_Cluster_adaptive_training_with_factorized_decision_trees_for_speech_recognition
[102]	M. Delcroix, K. Kinoshita, T. Hori, and T. Nakatani, "Context adaptive deep neural networks for fast acoustic model adaptation, " in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing South Brisbane, QLD, Australia, 2015, pp. 4535-4539. http://ieeexplore.ieee.org/document/7178829/
[103]	M. Delcroix, K. Kinoshita, C. Z. Yu, A. Ogawa, T. Yoshioka, and T. Nakatani, "Context adaptive deep neural networks for fast acoustic model adaptation in noisy conditions, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 5270-5274. https://www.researchgate.net/publication/304372396_Context_adaptive_deep_neural_networks_for_fast_acoustic_model_adaptation_in_noisy_conditions
[104]	Y. Zhao, J. Li, K. Kumar, and Y. Gong, "Extended low-rank plus diagonal adaptation for deep and recurrent neural networks, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, USA, 2017.
[105]	M. Cooke, J. R. Hershey, and S. J. Rennie, "Monaural speech separation and recognition challenge, " Computer Speech Lang., vol. 24, no. 1, pp. 1-15, Jan. 2010.
[106]	C. Weng, D. Yu, M. L. Seltzer, and J. Droppo, "Deep neural networks for single-channel multi-talker speech recognition, " IEEE/ACM Trans. Audio Speech Lang Processing, vol. 23, no. 10, pp. 1670-1679, Oct. 2015.
[107]	Y. X. Wang, A. Narayanan, and D. L. Wang, "On training targets for supervised speech separation, " IEEE/ACM Trans. Audio Speech Lang Processing, vol. 22, no. 12, pp. 1849-1858, Dec. 2014.
[108]	Y. Xu, J. Du, L. R. Dai, and C. H. Lee, "An experimental study on speech enhancement based on deep neural networks, " IEEE Signal Processing Lett., vol. 21, no. 1, pp. 65-68, Jan. 2014.
[109]	F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, "Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, " in Latent Variable Analysis and Signal Separation. LVA/ICA 2015. Lecture Notes in Computer Science, Vincent E., Yeredor A., Koldovsky Z., Tichavsk y P, Eds. Cham:Springer, 2015, pp. 91-99. https://www.researchgate.net/publication/278747057_Speech_enhancement_with_LSTM_recurrent_neural_networks_and_its_application_to_noise-robust_ASR
[110]	P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, "Joint optimization of masks and deep recurrent neural networks for monaural source separation, " IEEE/ACM Trans. Audio Speech Lang Processing, vol. 23, no. 12, pp. 2136-2147, Dec. 2015.
[111]	J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, "Deep clustering:Discriminative embeddings for segmentation and separation, " in Proc. 2016 IEEE Int. Conf. Acoust. Speech Signal Process, Shanghai, China, 2016, pp. 31-35. http://labrosa.ee.columbia.edu/cuneuralnet/chen111815.pdf
[112]	Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, "Singlechannel multi-speaker separation using deep clustering, " in 17th Proc. Interspeech, San Francisco, USA, 2016, pp. 545-549. https://www.merl.com/publications/docs/TR2016-073.pdf
[113]	M. Cooke, Modelling Auditory Processing and Organisation. Cambridge:Cambridge Univ. Press, 2005.
[114]	D. P. Ellis, "Prediction-driven computational auditory scene analysis, " Ph.D. dissertation, Massachusetts:Massachusetts Inst. Technol., 1996. http://academiccommons.columbia.edu/download/fedora_content/download/ac:144539/CONTENT/dpwe-phd-thesis.pdf
[115]	M. Wertheimer, Laws of organization in perceptual forms, in A Source Book of Gestalt Psychology, W. D. Ellis, Ed. Trench:Trubner & Company, 1938.
[116]	M. N. Schmidt and R. K. Olsson, "Single-channel speech separation using sparse non-negative matrix factorization, " in Proc. 2006-ICSLP the 9th Int. Conf. on Spoken Language Processing, Pittsburgh, PA, USA, 2006. https://www.researchgate.net/publication/221491907_Single-channel_speech_separation_using_sparse_non-negative_matrix_factorization
[117]	P. Smaragdis, "Convolutive speech bases and their application to supervised speech separation, " IEEE/ACM Trans. Audio Speech Lang. Processing, vol. 15, no. 1, pp. 1-12, Jan. 2007.
[118]	J. Le Roux, F. Weninger, and J. Hershey, "Sparse NMF-half-baked or well done, " Mitsubishi Electr. Res. Labs (MERL), Cambridge, MA, USA, Tech. Rep. TR2015-023, Mar. 2015.
[119]	T. T. Kristjansson, J. R. Hershey, P. A. Olsen, S. J. Rennie, and R. A. Gopinath, "Super-human multi-talker speech recognition:the IBM 2006 speech separation challenge system, " in Proc. 2006-ICSLP the 9th Int. Conf. Spoken Language Processing, Pittsburgh, PA, USA, 2006, Article ID 1775-Mon1WeS.7. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.116.4496
[120]	T. Virtanen, "Speech recognition using factorial hidden markov models for separation in the feature space, " in Proc. 2006-ICSLP 9th Int.e Conf. Spoken Language Processing, Pittsburgh, PA, USA, 2006. https://www.researchgate.net/publication/221489983_Speech_recognition_using_factorial_hidden_Markov_models_for_separation_in_the_feature_space
[121]	R. J. Weiss and D. P. W. Ellis, "Monaural speech separation using source-adapted models, " in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 2007, pp. 114-117. https://www.researchgate.net/publication/4295684_Monaural_Speech_Separation_using_Source-Adapted_Models
[122]	Z. Ghahramani and M. I. Jordan, "Factorial hidden Markov models, " Mach. Learn., vol. 29, no. 2-3, pp. 245-273, Nov. 1997.
[123]	Z. Chen, Y. Luo, and N. Mesgarani, "Deep attractor network for singlemicrophone speaker separation, " in Proc. 2017 IEEE Int. Conf. Acoust. Speech Signal Process, New Orleans, USA, 2017. https://arxiv.org/pdf/1611.08930v1
[124]	D. Yu, X. Chang, and Y. M. Qian, "Recognizing multi-talker speech with permutation invariant training, " in 18th Proc. Interspeech, Stockholm, Sweden, 2017. https://www.researchgate.net/publication/315835420_Recognizing_Multi-talker_Speech_with_Permutation_Invariant_Training
[125]	I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets, " in Proc. the 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672-2680. http://www.foldl.me/uploads/2015/conditional-gans-face-generation/paper.pdf
[126]	Y. Shinohara, "Adversarial multi-task learning of deep neural networks for robust speech recognition, " in 17th Proc. Interspeech, San Francisco, USA, 2016, pp. 2369-2372. http://www.isca-speech.org/archive/Interspeech_2016/pdfs/0879.PDF
[127]	D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and Y. Bengio, "Invariant representations for noisy speech recognition, " arXiv:1612.01928, 2016. https://www.researchgate.net/publication/311458959_Invariant_Representations_for_Noisy_Speech_Recognition
[128]	S. N. Sun, B. B. Zhang, L. Xie, and Y. N. Zhang, "An unsupervised deep domain adaptation approach for robust speech recognition, " Neurocomputing, to be published. http://www.sciencedirect.com/science/article/pii/S0925231217301492
[129]	Y. Ganin and V. Lempitsky, "Unsupervised domain adaptation by backpropagation, " arXiv:1409.7495, 2014. https://www.researchgate.net/publication/266204110_Unsupervised_Domain_Adaptation_by_Backpropagation
[130]	R. Lippmann, E. Martin, and D. Paul, "Multi-style training for robust isolated-word speech recognition, " in Proc. IEEE Int. Conf. ICASSP'87 Acoustics, Speech, and Signal Processing, Dallas, TX, USA, USA, 1987, pp. 705-708. https://www.researchgate.net/publication/224737757_Multi-style_training_for_robust_isolated-word_speech_recognition
[131]	T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, "A study on data augmentation of reverberant speech for robust speech recognition, " in Proc. 2017 IEEE Int. Conf. Acoust. Speech Signal Process, New Orleans, USA, 2017.
[132]	J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong, "Largescale domain adaptation via teacher-student learning, " in 18th Proc. Interspeech, 2017.
[133]	G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network, " arXiv:1503.02531, 2015. https://www.researchgate.net/publication/273387909_Distilling_the_Knowledge_in_a_Neural_Network
[134]	K. Markov and T. Matsui, "Robust speech recognition using generalized distillation framework, " in 17th Proc. Interspeech, San Francisco, USA, 2016, pp. 2364-2368. https://www.researchgate.net/profile/Konstantin_Markov/publication/307889099_Robust_Speech_Recognition_Using_Generalized_Distillation_Framework/links/57e0c25608aece48e9e20398.pdf?origin=publication_detail
[135]	S. Watanabe, T. Hori, J. Le Roux, and J. R. Hershey, "Student-teacher network learning with enhanced features, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Broadway, USA, 2017. http://www.merl.com/publications/docs/TR2017-011.pdf
[136]	Z. Y. Lu, V. Sindhwani, and T. N. Sainath, "Learning compact recurrent neural networks, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP). Shanghai, China, 2016, pp. 5960-5964. https://www.researchgate.net/publication/301876248_Learning_Compact_Recurrent_Neural_Networks
[137]	R. Prabhavalkar, O. Alsharif, A. Bruguier, and L. McGraw, "On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP). Shanghai, China, 2016, pp. 5970-5974. http://adsabs.harvard.edu/abs/2016arXiv160308042P
[138]	W. Chan, N. R. Ke, and I. Lane, "Transferring knowledge from a RNN to a DNN, " arXiv:1504.01483, 2015. https://www.researchgate.net/publication/274730673_Transferring_Knowledge_from_a_RNN_to_a_DNN
[139]	K. J. Geras, A. R. Mohamed, R. Caruana, G. Urban, S. J. Wang, O. Aslan, M. Philipose, M. Richardson, and C. Sutton, "Blending LSTMs into CNNs, " arXiv:1511.06433, 2015. https://www.researchgate.net/publication/301548752_Blending_LSTMs_into_CNNs
[140]	L. Lu, M. Guo, and S. Renals, "Knowledge distillation for smallfootprint highway networks, " in Proc. 2017 IEEE Int. Conf. Acoustics Speech and Signal Processing, New Orleans, USA, 2017. https://www.researchgate.net/publication/305780312_Knowledge_Distillation_for_Small-footprint_Highway_Networks
[141]	J. Cui, B. Kingsbury, B. Ramabhadran, G. Saon, T. Sercu, K. Audhkhasi, A. Sethy, M. Nussbaum-Thom, and A. Rosenberg, Knowledge distillation across ensembles of multilingual models for low-resource languages, in ICASSP, 2017. doi: 10.1109/ICASSP.2017.7953073
[142]	J. Li, R. Zhao, J. T. Huang, and Y. F. Gong, "Learning small-size DNN with output-distribution-based criteria, " in 15th Proc. Interspeech, Singapore, Singapore, 2014, pp. 1910-1914. https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenTerm1201415/zhao.pdf
[143]	W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, "Achieving human parity in conversational speech recognition, " arXiv:1610.05256, 2016. https://www.researchgate.net/publication/309207213_Achieving_Human_Parity_in_Conversational_Speech_Recognition
[144]	G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. D. Cui, B. Ramabhadran, M. Picheny, L. L. Lim, B. Roomi, amd P. Hall, "English conversational telephone speech recognition by humans and machines, " arXiv:1703.02136, 2017. https://www.researchgate.net/publication/314283069_English_Conversational_Telephone_Speech_Recognition_by_Humans_and_Machines
[145]	V. Vanhoucke, A. Senior, and M. Z. Mao, "Improving the speed of neural networks on CPUs, " in Proc. 2011 Deep Learning and Unsupervised Feature Learning NIPS Workshop, Granada, Spain, 2011. https://www.researchgate.net/publication/267429210_Improving_the_speed_of_neural_networks_on_CPUs
[146]	R. Alvarez, R. Prabhavalkar, and A. Bakhtin, "On the efficient representation and execution of deep acoustic models, " arXiv:1607.04683v1, 2016. https://www.researchgate.net/publication/307889746_On_the_Efficient_Representation_and_Execution_of_Deep_Acoustic_Models
[147]	R. Takeda, K. Nakadai, and K. Komatani, "Acoustic model training based on node-wise weight boundary model for fast and small-footprint deep neural networks, " Computer Speech & Language, to be published. http://www.sciencedirect.com/science/article/pii/S0885230816300699
[148]	Y. Q. Wang, J. Li, and Y. F. Gong, "Small-footprint high-performance deep neural network-based speech recognition using split-VQ, " in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing Brisbane, QLD, Australia, 2015, pp. 4984-4988. https://www.researchgate.net/publication/308821136_Small-footprint_high-performance_deep_neural_network-based_speech_recognition_using_split-VQ
[149]	V. Vanhoucke, M. Devin, and G. Heigold, "Multiframe deep neural networks for acoustic modeling, " in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing Vancouver, BC, Canada, 2013, pp. 7582-7585. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.308.2471&rep=rep1&type=pdf
[150]	G. Pundak and T. N. Sainath, "Lower frame rate neural network acoustic models, " in 17th Proc. Interspeech, San Francisco, USA, 2016, pp. 22-26. https://www.researchgate.net/publication/307889457_Lower_Frame_Rate_Neural_Network_Acoustic_Models

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(1)

Get Citation

PDF

XML

Article Metrics

Article views (4867) PDF downloads(844)

Recent Progresses in Deep Learning Based Acoustic Models

doi: 10.1109/JAS.2017.7510508

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content