Details

Title

Speech emotion recognition using wavelet packet reconstruction with attention-based deep recurrent neutral networks

Journal title

Bulletin of the Polish Academy of Sciences Technical Sciences

Yearbook

2021

Volume

69

Issue

No. 1

Affiliation

Meng, Hao : Key laboratory of Intelligent Technology and Application of Marine Equipment (Harbin Engineering University), Ministry of Education, Harbin, 150001, China ; Yan, Tianhao : Key laboratory of Intelligent Technology and Application of Marine Equipment (Harbin Engineering University), Ministry of Education, Harbin, 150001, China ; Wei, Hongwei : Key laboratory of Intelligent Technology and Application of Marine Equipment (Harbin Engineering University), Ministry of Education, Harbin, 150001, China ; Ji, Xun : College of Marine Electrical Engineering, Dalian Maritime University, Dalian, 116026, China

Authors

Keywords

speech emotion recognition ; voice activity detection ; wavelet packet reconstruction ; feature extraction ; LSTM network ; attention mechanism

Divisions of PAS

Nauki Techniczne

Coverage

e136300

Bibliography

  1.  M. Gupta, et al., “Emotion recognition from speech using wavelet packet transform and prosodic features”, J. Intell. Fuzzy Syst. 35, 1541–1553 (2018).
  2.  M. El Ayadi, et al., “Survey on speech emotion recognition: Features, classification schemes, and databases”, Pattern Recognit. 44, 572–587 (2011).
  3.  P. Tzirakis, et al., “End-to-end speech emotion recognition using deep neural networks”, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 5089‒5093, doi: 10.1109/ICASSP.2018.8462677.
  4.  J.M Liu, et al., “Learning Salient Features for Speech Emotion Recognition Using CNN”, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), Beijing, China, 2018, pp. 1‒5, doi: 10.1109/ACIIAsia.2018.8470393.
  5.  J. Kim, et al., “Learning spectro-temporal features with 3D CNNs for speech emotion recognition”, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, USA, 2017, pp. 383‒388, doi: 10.1109/ACII.2017.8273628.
  6.  M.Y Chen, X.J He, et al., “3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition”, IEEE Signal Process Lett. 25(10), 1440‒1444 (2018), doi: 10.1109/LSP.2018.2860246.
  7.  V.N. Degaonkar and S.D. Apte, “Emotion modeling from speech signal based on wavelet packet transform”, Int. J. Speech Technol. 16, 1‒5 (2013).
  8.  T. Feng and S. Yang, “Speech Emotion Recognition Based on LSTM and Mel Scale Wavelet Packet Decomposition”, Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2018), New York, USA, 2018, art. 38.
  9.  P. Yenigalla, A. Kumar, et. al”, Speech Emotion Recognition Using Spectrogram & Phoneme Embedding Promod”, Proc. Interspeech 2018, 2018, pp. 3688‒3692, doi: 10.21437/Interspeech.2018-1811.
  10.  J. Kim, K.P. Truong, G. Englebienne, and V. Evers, “Learning spectro-temporal features with 3D CNNs for speech emotion recognition”, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, USA, 2017, pp. 383‒388, doi: 10.1109/ACII.2017.8273628.
  11.  S. Jing, X. Mao, and L. Chen, “Prominence features: Effective emotional features for speech emotion recognition”, Digital Signal Process. 72, 216‒231 (2018).
  12.  L. Chen, X. Mao, P. Wei, and A. Compare, “Speech emotional features extraction based on electroglottograph”, Neural Comput. 25(12), 3294–3317 (2013).
  13.  J. Hook, et al., “Automatic speech based emotion recognition using paralinguistics features”, Bull. Pol. Ac.: Tech. 67(3), 479‒488, 2019.
  14.  A. Mencattini, E. Martinelli, G. Costantini, M. Todisco, B. Basile, M. Bozzali, and C. Di Natale, “Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure”, Knowl.-Based Syst. 63, 68–81 (2014).
  15.  H. Mori, T. Satake, M. Nakamura, and H. Kasuya, “Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics”, Speech Commun. 53(1), 36–50 (2011).
  16.  B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, and S. Narayanan, “Paralinguistics in speech and language—state- of-the-art and the challenge”, Comput. Speech Lang. 27(1), 4–39 (2013).
  17.  S. Mariooryad and C. Busso, “Compensating for speaker or lexical variabilities in speech for emotion recognition”, Speech Commun. 57, 1–12 (2014).
  18.  G.Trigeorgis et.al, “Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network”, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016, pp. 5200‒5204, doi: 10.1109/ ICASSP.2016.7472669.
  19.  Y. Xie et.al, “Attention-based dense LSTM for speech emotion recognition”, IEICE Trans. Inf. Syst. E102.D, 1426‒1429 (2019).
  20.  F. Tao and G.Liu, “Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition”, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 2906‒2910, doi: 10.1109/ ICASSP.2018.8461750.
  21.  Y.M. Huang and W. Ao, “Novel Sub-band Spectral Centroid Weighted Wavelet Packet Features with Importance-Weighted Support Vector Machines for Robust Speech Emotion Recognition”, Wireless Personal Commun. 95, 2223–2238 (2017).
  22.  Firoz Shah A. and Babu Anto P., “Wavelet Packets for Speech Emotion Recognition”, 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), Chennai, 2017, pp. 479‒481, doi: 10.1109/ AEEICB.2017.7972358.
  23.  K.Wang, N. An, and L. Li, “Speech Emotion Recognition Based on Wavelet Packet Coefficient Model”, The 9th International Symposium on Chinese Spoken Language Processing, Singapore, China, 2014, pp. 478‒482, doi: 10.1109/ISCSLP.2014.6936710.
  24.  S. Sekkate, et al., “An Investigation of a Feature-Level Fusion for Noisy Speech Emotion Recognition”, Computers 8, 91 (2019).
  25.  Varsha N. Degaonkar and Shaila D. Apte, “Emotion Modeling from Speech Signal based on Wavelet Packet Transform”, Int. J. Speech Technol. 16, 1–5 (2013).
  26.  F. Eyben, et al., “Opensmile: the munich versatile and fast open-source audio feature extractor”, MM ’10: Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459‒1462.
  27.  Ch.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011,” Artif. Intell. 43(2), 155–177 (2015).
  28.  H. Meng, T. Yan, F. Yuan, and H. Wei, “Speech Emotion Recognition From 3D Log-Mel SpectrogramsWith Deep Learning Network”, IEEE Access 7, 125868‒125881 (2019).
  29.  Keren, Gil and B. Schuller. “Convolutional RNN: An enhanced model for extracting features from sequential data,” International Joint Conference on Neural Networks, 2016, pp. 3412‒3419.
  30.  C.W. Huang and S.S. Narayanan, “Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition”, IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, 2017, pp. 583‒588, doi: 10.1109/ ICME.2017.8019296.
  31.  S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic Speech Emotion Recognition using Recurrent Neural Networks with Local Attention”, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, 2017, pp. 2227- 2231, doi: 10.1109/ICASSP.2017.7952552.
  32.  Ashish Vaswani, et al., “Attention Is All You Need”, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, USA, 2017.
  33.  X.J Wang, et al., “Dynamic Attention Deep Model for Article Recommendation by Learning Human Editors’ Demonstration”, Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, 2017.
  34.  C. Busso, et al., “IEMOCAP: interactive emotional dyadic motion capture database,” Language Resources & Evaluation 42(4), 335 (2008).
  35.  F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, and B.Weiss, “A database of German emotional speech,” INTERSPEECH 2005 – Eurospeech, Lisbon, Portugal, 2005, pp. 1517‒1520.
  36.  D. Kingma and J. Ba, “International Conference on Learning Representations (ICLR)”, ICLR, San Diego, USA, 2015.
  37.  F. Vuckovic, G. Lauc, and Y. Aulchenko. “Normalization and batch correction methods for high-throughput glycomics”, Joint Meeting of the Society-For-Glycobiology 2016, pp. 1160‒1161.

Date

19.02.2021

Type

Article

Identifier

DOI: 10.24425/bpasts.2020.136300

Source

Bulletin of the Polish Academy of Sciences: Technical Sciences; 2021; 69; No. 1; e136300
×