Deep Scalogram Representations for Acoustic Scene Classification

Zhao Ren; Kun Qian; Zixing Zhang; Vedhas Pandit; Alice Baird; Björn Schuller

doi:10.1109/JAS.2018.7511066

Volume 5 Issue 3

May 2018

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2018 > 5(3): 662-669

Zhao Ren, Kun Qian, Zixing Zhang, Vedhas Pandit, Alice Baird and Björn Schuller, "Deep Scalogram Representations for Acoustic Scene Classification," IEEE/CAA J. Autom. Sinica, vol. 5, no. 3, pp. 662-669, Mar. 2018. doi: 10.1109/JAS.2018.7511066

Citation:

Zhao Ren, Kun Qian, Zixing Zhang, Vedhas Pandit, Alice Baird and Björn Schuller, "Deep Scalogram Representations for Acoustic Scene Classification," IEEE/CAA J. Autom. Sinica, vol. 5, no. 3, pp. 662-669, Mar. 2018. doi: 10.1109/JAS.2018.7511066

Citation:

PDF( 9429 KB)

Deep Scalogram Representations for Acoustic Scene Classification

doi: 10.1109/JAS.2018.7511066

1.
ZD. B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
2.
Machine Intelligence and Signal Processing Group, Technische Universität München, Germany, and also with the ZD. B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
3.
Group on Language, Audio and Music(GLAM), Imperial College London, UK
4.
Group on Language, Audio and Music(GLAM), Imperial College London, UK, and also with the ZD. B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany

Funds:

the German National BMBF IKT2020-Grant 16 SV7213

the German National BMBF IKT2020-Grant EmotAsS

the European-Unions Horizon 2020 Research and Innovation Programme 688835

the European-Unions Horizon 2020 Research and Innovation Programme DE-ENIGMA

the China Scholarship Council CSC

More Information

Author Bio:
Zhao Ren (S'17) received the master degree in computer science and technology from Northwestern Polytechnical University (NWPU), China, 2017. Currently, she is a Research Assistant and working on the Ph.D. degree at the ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany, where she is involved with the German national BMBF IKT2020-Grant project EmotAsS, for emotion analysis based on speech. Her research interests mainly lie in transfer learning, unsupervised learning, and deep learning for the application in health care and wellbeing.(e-mail: zhao.ren@informatik.uni-augsburg.de)

Kun Qian (S'14) received the master degree in signal and information processing from the Nanjing University of Science and Technology (NUST), China, 2014. Currently, he is working on the Ph.D. degree in electrical engineering and information technology at Technische Universität München (TUM), Munich, Germany. He was sponsored by scholarships to conduct cooperative research at the Nanyang Technological University (NTU), Singapore, the Tokyo Institute of Technology (Tokyo Tech), Japan, and the Carnegie Mellon University (CMU), USA. His research interests include signal processing, machine learning, biomedical engineering, and deep learning in high performance computing systems.(e-mail: andykun.qian@tum.de)

Zixing Zhang (M'15) received the master degree in physical electronics from Beijing University of Posts and Telecommunications, China, 2010, and the Ph.D. degree in engineering from the Machine Intelligence and Signal Processing group at Technische Universität München (TUM), Munich, Germany, 2015. He is currently a Research Associate at Imperial College London, UK. He has authored more than fifty publications in peer-reviewed journals and conference proceedings. His research interests mainly lie in semi-supervised learning, active learning, and deep learning for the application in affective computing.(e-mail: zixing.zhang@imperial.ac.uk)

Vedhas Pandit (S'11) received the master degree from Arizona State University (ASU) in USA, 2010, in electronic and mixed signal circuit design (EECE) with his thesis on mathematical modelling of a-Si:H SOI transistors. After working for Intel as a Graphics Hardware Engineer, he worked as a Researcher at the Indian Institute of Technology Bombay (IITB) developing tools for automated music information retrieval. Since February 2015, he has been working on the Ph.D. degree at the University of Passau, Germany, and the University of Augsburg, Germany. His research interests include music information retrieval, speech and virtual instrument synthesis, deep learning strategies in machine learning, and biomedical signal processing.(e-mail: vedhas.pandit@informatik.uni-augsburg.de)

Alice Baird is a Research Assistant at the ZD.B Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Germany, where she is involved with the Horizon 2020 project DEENIGMA, for analysis of vocal and linguistic cues. Alice has recently been awarded a ZD.B Ph.D. Fellowship (2018-2021), in which she will research speech monitoring and soundscape synthesis. Alice has an (S'16) M.FA in Sound Arts from Columbia University, Computer Music Center, and a (S'13) B.A. in Music Technology from London Metropolitan University. Alice works across an array of disciplines, predominately in the realm of paralinguistic speech and intelligent audio analysis. Her research focus is towards applications of computing for health and wellbeing, with consideration to methodologies for 'in the wild' data collection.(e-mail: alice.baird@informatik.uni-augsburg.de)

Björn Schuller (M'06-SM'15-F'18) received his diploma in 1999, his doctoral degree for his study on automatic speech and emotion recognition in 2006, and his habilitation and adjunct teaching professorship in the subject area of signal processing and machine intelligence in 2012, all in electrical engineering and information technology from Technische Universität München (TUM), Germany. He is a tenured Full Professor heading the Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany, and a Reader (Associate Professor) in Machine Learning heading GLAM-the Group on Language, Audio and Music, Department of Computing at the Imperial College London in London, UK. Dr. Schuller is elected member of the IEEE Speech and Language Processing Technical Committee, Editor in Chief of the IEEE Transactions on Affective Computing, President-emeritus of the AAAC, Fellow of the IEEE, and Senior Member of the ACM. He (co-)authored 5 books and more than 700 publications in peer reviewed books, journals, and conference proceedings leading to more than 17 000 citations (h-index 64).(e-mail: schuller@ieee.org)
Corresponding author: Zhao Ren, e-mail: zhao.ren@informatik.uni-augsburg.de
Received Date: 2018-01-29
Accepted Date: 2018-02-26

Abstract

Abstract

Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly, the features extracted from a subsequent fully connected layer are fed into (bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer; finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). On the evaluation set, an accuracy of 64.0% from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0% baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy, when fusing with a spectrogram-based system.

FullText(HTML)

References(53)

References

[1]	E. Marchi, D. Tonelli, X. Z. Xu, F. Ringeval, J. Deng, S. Squartini, and B. Schuller, "Pairwise decomposition with deep neural networks and multiscale kernel subspace learning for acoustic scene classification, " in Proc. Detection and Classification of Acoustic Scenes and Events, Budapest, Hungary, 2016, pp. 65-69.
[2]	W. He, Z. J. Li, and C. L. P. Chen, "A survey of human-centered intelligent robots: Issues and challenges, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 4, pp. 602-609, Oct. 2017. http://www.ieee-jas.org/CN/abstract/abstract280.shtml
[3]	F. Eyben, F. Weninger, F. Groß, and B. Schuller, "Recent developments in openSMILE, the Munich open-source multimedia feature extractor, " in Proc. 21st ACM Int. Conf. Multimedia, Barcelona, Spain, 2013, pp. 835-838. https://dl.acm.org/citation.cfm?doid=2502081.2502224
[4]	L. Li, Y. L. Lin, N. N. Zheng, and F. Y. Wang, "Parallel learning: A perspective and a framework, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 3, pp. 389-395, Jul. 2017. doi: 10.1109/JAS.2017.7510493
[5]	F. Y. Wang, N. N. Zheng, D. P. Cao, C. M. Martinez, L. Li, and T. Liu, "Parallel driving in CPSS: A unified approach for transport automation and vehicle intelligence, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 4, pp. 577-587, Oct. 2017. doi: 10.1109/JAS.2017.7510598
[6]	S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird, and B. Schuller, "Snore sound classification using image-based deep spectrum features, " in Proc. INTERSPEECH 2017: Conf. Int. Speech Communication Association, Stockholm, Sweden, 2017, pp. 3512-3516. https://dl.acm.org/citation.cfm?doid=2502081.2502224
[7]	M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, "DCASE 2016 acoustic scene classification using convolutional neural networks, " in Proc. Detection and Classification of Acoustic Scenes and Events 2016, Budapest, Hungary, 2016, pp. 95-99.
[8]	I. Daubechies, "The wavelet transform, time-frequency localization and signal analysis, " IEEE Trans. Inf. Theory, vol. 36, no. 5, pp. 961-1005, Sep. 1990. http://ieeexplore.ieee.org/document/57199/
[9]	V. N. Varghees and K. I. Ramachandran, "Effective heart sound segmentation and murmur classification using empirical wavelet transform and instantaneous phase for electronic stethoscope, " IEEE Sens. J., vol. 17, no. 12, pp. 3861-3872, Jun. 2017. http://ieeexplore.ieee.org/document/7903626
[10]	K. Qian, C. Janott, Z. X. Zhang, C. Heiser, and B. Schuller, "Wavelet features for classification of vote snore sounds, " in Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 221-225.
[11]	K. Qian, C. Janott, J. Deng, C. Heiser, W. Hohenhorst, M. Herzog, N. Cummins, and B. Schuller, "Snore sound recognition: on wavelets and classifiers from deep nets to kernels, " in Proc. 39th Ann. Int. Conf. of the IEEE Engineering in Medicine and Biology Society, Seogwipo, South Korea, 2017, pp. 3737-3740.
[12]	K. Qian, C. Janott, V. Pandit, Z. X. Zhang, C. Heiser, W. Hohenhorst, M. Herzog, W. Hemmert, and B. Schuller, "Classification of the excitation location of snore sounds in the upper airway by acoustic multifeature analysis, " IEEE Trans. Biomed. Eng., vol. 64, no. 8, pp. 1731-1741, Aug. 2017. http://ieeexplore.ieee.org/document/7605472/
[13]	K. Qian, Z. Ren, V. Pandit, Z. J. Yang, Z. X. Zhang, and B. Schuller, "Wavelets revisited for the classification of acoustic scenes, " in Proc. Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 2017, pp. 108-112.
[14]	O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. H. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, "ImageNet large scale visual recognition challenge, " Int. J. Comput. Vis., vol. 115, no. 3, pp. 211-252, Dec. 2015.
[15]	J. Schlüter and S. Böck, "Improved musical onset detection with convolutional neural networks, " in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy, 2014, pp. 6979-6983. http://ieeexplore.ieee.org/document/6854953/
[16]	G. Gwardys and D. Grzywczak, "Deep image features in music information retrieval, " Int. J. Electron. Telecomm., vol. 60, no. 4, pp. 321-326, Dec. 2014. https://www.deepdyve.com/lp/de-gruyter/deep-image-features-in-music-information-retrieval-k0MzODXMRz
[17]	J. Deng, N. Cummins, J. Han, X. Z. Xu, Z. Ren, V. Pandit, Z. X. Zhang, and B. Schuller, "The University of Passau open emotion recognition system for the multimodal emotion challenge, " in Proc. 7th Chinese Conf. Pattern Recognition (CCPR), Chengdu, China, 2016, pp. 652-666. doi: 10.1007/978-981-10-3005-5_54
[18]	A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks, " in Proc. 25th Int. Conf. Neural Information Processing Systems, Lake Tahoe, Nevada, USA, 2012, pp. 1097-1105. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
[19]	K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition, " in Proc. Int. Conf. Learning Representations, San Diego, CA, USA, 2015.
[20]	S. J. Pan and Q. Yang, "A survey on transfer learning, " IEEE Trans. Knowl. Data Eng. , vol. 22, no. 10, pp. 1345-1359, Oct. 2010.
[21]	W. Y. Zhang, H. G. Zhang, J. H. Liu, K. Li, D. S. Yang, and H. Tian, "Weather prediction with multiclass support vector machines in the fault detection of photovoltaic system, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 3, pp. 520-525, Jul. 2017. http://www.ieee-jas.org/EN/abstract/abstract270.shtml
[22]	S. Young, G. Evermann, D. Kershaw, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book. Cambridge, UK:Cambridge University Engineering Department, 2002.
[23]	D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction:Learning Algorithms, Architectures and Stability. New York, USA:Wiley Online Library, 2002.
[24]	S. Hochreiter and J. Schmidhuber, "Long short-term memory, " Neural Comput. , vol. 9, no. 8, pp. 1735-1780, Nov. 1997.
[25]	S. H. Bae, I. Choi, and N. S. Kim, "Acoustic scene classification using parallel combination of LSTM and CNN, " in Proc. Detection and Classification of Acoustic Scenes and Events 2016, Budapest, Hungary, 2016, pp. 11-15.
[26]	D. Yu and J. Y. Li, "Recent progresses in deep learning based acoustic models, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 3, pp. 396-409, Jul. 2017. http://www.ieee-jas.org/EN/abstract/abstract260.shtml
[27]	J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling, " in Proc. NIPS 2014 Deep Learning and Representation Learning Workshop, Montreal, Canada, 2014.
[28]	Z. Ren, V. Pandit, K. Qian, Z. J. Yang, Z. X. Zhang, and B. Schuller, "Deep sequential image features for acoustic scene classification, " in Proc. Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 2017, pp. 113-117.
[29]	A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, "DCASE 2017 challenge setup: tasks, datasets and baseline system, " in Proc. Workshop on Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 2017, pp. 85-92.
[30]	S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, "CNN architectures for large-scale audio classification, " in Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 2017, pp. 131-135.
[31]	S. Amiriparian, M. Freitag, N. Cummins, and B. Schuller, "Sequence to sequence autoencoders for unsupervised representation learning from audio, " in Proc. Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 2017, pp. 17-21.
[32]	E. Fonseca, R. Gong, D. Bogdanov, O. Slizovskaia, E. Gomez, and X. Serra, "Acoustic scene classification by ensembling gradient boosting machine and convolutional neural networks, " in Proc. Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 2017, pp. 37-41.
[33]	A. Vafeiadis, D. Kalatzis, K. Votis, D. Giakoumis, D. Tzovaras, L. M. Chen, and R. Hamzaoui, "Acoustic scene classification: From a hybrid classifier to deep learning, " in Proc. Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 2017, pp. 123-127.
[34]	S. Park, S. Mun, Y. Lee, and H. Ko, "Acoustic scene classification based on convolutional neural network using double image features, " in Proc. Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 2017, pp. 98-102.
[35]	R. N. Khushaba, S. Kodagoda, S. Lal, and G. Dissanayake, "Driver drowsiness classification using fuzzy wavelet-packet-based feature-extraction algorithm, " IEEE Trans. Biomed. Eng., vol. 58, no. 1, pp. 121-131, Jan. 2011. http://ieeexplore.ieee.org/document/5580017/
[36]	T. H. Vu and J. C. Wang, "Acoustic scene and event recognition using recurrent neural networks, " in Proc. Detection and Classification of Acoustic Scenes and Events 2016, Budapest, Hungary, 2016.
[37]	M. Zöhrer and F. Pernkopf, "Gated recurrent networks applied to acoustic scene classification and acoustic event detection, " in Proc. Detection and Classification of Acoustic Scenes and Events 2016, Budapest, Hungary, 2016, pp. 115-119.
[38]	E. Sejdić, I. Djurović, and J. Jiang, "Time-frequency feature representation using energy concentration: an overview of recent advances, " Digit. Signal Process., vol. 19, no. 1, pp. 153-183, Jan. 2009. https://www.sciencedirect.com/science/article/pii/S105120040800002X
[39]	I. Daubechies, Ten Lectures on Wavelets. Philadelphia, Pa, USA:SIAM, 1992.
[40]	S. C. Olhede and A. T. Walden, "Generalized morse wavelets, " IEEE Trans. Signal Process., vol. 50, no. 11, pp. 2661-2670, Nov. 2002.
[41]	A. Vedaldi and K. Lenc, "MatConvNet: Convolutional neural networks for MATLAB, " in Proc. 23rd ACM Int. Conf. Multimedia, Brisbane, Australia, 2015, pp. 689-692.
[42]	R. Jozefowicz, W. Zaremba, and I. Sutskever, "An empirical exploration of recurrent network architectures, " in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2342-2350.
[43]	D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate, " in Proc. Int. Conf. Learning Representations 2015, San Diego, CA, USA, 2015.
[44]	Z. C. Yang, D. Y. Yang, C. Dyer, X. D. He, A. J. Smola, and E. H. Hovy, "Hierarchical attention networks for document classification, " in Proc. NAACL+HLT 2016, San Diego, CA, USA, 2016, pp. 1480-1489.
[45]	M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks, " IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673-2681, Nov. 1997. http://ieeexplore.ieee.org/document/650093/
[46]	R. K. Srivastava, K. Greff, and J. Schmidhuber, "Highway networks, " arXiv preprint, arXiv: 1505. 00387, 2015.
[47]	T. Scheffer, C. Decomain, and S. Wrobel, "Active hidden Markov models for information extraction, " in Proc. 4th Int. Conf. Advances in Intelligent Data Analysis, Porto, Portugal, 2001, pp. 309-318. doi: 10.1007/3-540-44816-0_31
[48]	K. Qian, Z. X. Zhang, A. Baird, and B. Schuller, "Active learning for bird sound classification via a kernel-based extreme learning machine, " J. Acoust. Soc. Am., vol. 142, no. 4, pp. 1796, Oct. 2017.
[49]	A. Mesaros, T. Heittola, and T. Virtanen, "TUT database for acoustic scene classification and sound event detection, " in Proc. 24th European Signal Processing Conf. , Budapest, Hungary, 2016, pp. 1128-1132. http://ieeexplore.ieee.org/document/7760424/
[50]	B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. Kim, "The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism, " in Proc. 14th Ann. Conf. Int. Speech Communication Association, Lyon, France, 2013, pp. 148-152.
[51]	S. Mun, S. Park, D. K. Han, and H. Ko, "Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane, " in Proc. Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 2017, pp. 93-97.
[52]	I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets, " in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672-2680.
[53]	K. F. Wang, C. Gou, Y. J. Duan, Y. L. Lin, X. H. Zheng, and F. Y. Wang, "Generative adversarial networks: introduction and outlook, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 4, pp. 588-598, Oct. 2017. http://www.ieee-jas.org/CN/abstract/abstract278.shtml

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(4) / Tables(3)

Get Citation

PDF

XML

Article Metrics

Article views (1826) PDF downloads(211)

Deep Scalogram Representations for Acoustic Scene Classification

doi: 10.1109/JAS.2018.7511066

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content