Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU

Apeksha Shewalkar; Deepika Nyavanandi; Simone A. Ludwig

doi:10.2478/jaiscr-2019-0006

Abstract

Deep Neural Networks (DNN) are nothing but neural networks with many hidden layers. DNNs are becoming popular in automatic speech recognition tasks which combines a good acoustic with a language model. Standard feedforward neural networks cannot handle speech data well since they do not have a way to feed information from a later layer back to an earlier layer. Thus, Recurrent Neural Networks (RNNs) have been introduced to take temporal dependencies into account. However, the shortcoming of RNNs is that long-term dependencies due to the vanishing/exploding gradient problem cannot be handled. Therefore, Long Short-Term Memory (LSTM) networks were introduced, which are a special case of RNNs, that takes long-term dependencies in a speech in addition to short-term dependencies into account. Similarily, GRU (Gated Recurrent Unit) networks are an improvement of LSTM networks also taking long-term dependencies into consideration. Thus, in this paper, we evaluate RNN, LSTM, and GRU to compare their performances on a reduced TED-LIUM speech data set. The results show that LSTM achieves the best word error rates, however, the GRU optimization is faster while achieving word error rates close to LSTM.

References

[1] G. E. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief nets, Neural Computation 18, 1527-1554, 2006.10.1162/neco.2006.18.7.152716764513
Open DOI Search in Google Scholar Back to article
[2] A. Rousseau, P. Deléglise, Y. Estève, Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks. Proceedings of Sventh Language Resources and Evaluation Conference, 3935-3939, May 2014.
Search in Google Scholar Back to article
[3] Y. Gaur, F. Metze, J. P. Bigham, Manipulating Word Lattices to Incorporate Human Corrections, Inter-speech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 2016.10.21437/Interspeech.2016-660
Search in Google Scholar Back to article
[4] E. Busseti, I. Osband, S. Wong, Deep Learning for Time Series Modeling, Seminar on Collaborative Intelligence in the TU Kaiserslautern, Germany, June 2012.
Search in Google Scholar Back to article
[5] Deep Learning for Sequential Data - Part V: Handling Long Term Temporal Dependencies, https://prateekvjoshi.com/2016/05/31/deep-learning-for-sequential-data-part-v-handling-long-term-temporal-dependencies/, last retrieved July 2017.
Search in Google Scholar Back to article
[6] Understanding LSTM Networks, http://colah.github.io/posts/2015-08-Understanding-LSTMs/, last retrieved July 2017.
Search in Google Scholar Back to article
[7] A. Graves, A. R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6645-6649, 2013.10.1109/ICASSP.2013.6638947
Search in Google Scholar Back to article
[8] A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling un-segmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning, 369-376, ACM, June 2006.10.1145/1143844.1143891
Search in Google Scholar Back to article
[9] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Search in Google Scholar Back to article
[10] TED-LIUM Corpus, http://www-lium.univlemans.fr/en/content/ted-lium-corpus, last retrieved July 2017.
Search in Google Scholar Back to article
[11] C. C. Chiu, D. Lawson, Y. Luo, G.Tucker, K. Swersky, I. Sutskever, N. Jaitly, An online sequence-to-sequence model for noisy speech recognition, arXiv preprint arXiv:1706.06428, 2017.
Search in Google Scholar Back to article
[12] T. Hori, S. Watanabe, Y. Zhang, W. Chan, Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM, arXiv preprint arXiv:1706.02737, 2017.10.21437/Interspeech.2017-1296
Search in Google Scholar Back to article
[13] W. Chan, N. Jaitly, Q. V. Le, O. Vinyals, Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.
Search in Google Scholar Back to article
[14] T. Mikolov, Statistical language models based on neural networks, PhD thesis, Brno University of Technology, 2012.
Search in Google Scholar Back to article
[15] W. Zaremba, I. Sutskever, O. Vinyals, Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
Search in Google Scholar Back to article
[16] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 3104-3112, 2014.
Search in Google Scholar Back to article
[17] F. A. Gers, E. Schmidhuber, LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks, 12(6), 1333-1340, 2001.10.1109/72.96376918249962
Open DOI Search in Google Scholar Back to article
[18] O. Vinyals, S. V. Ravuri, D. Povey, Revisiting recurrent neural networks for robust ASR. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4085-4088, 2012.10.1109/ICASSP.2012.6288816
Search in Google Scholar Back to article
[19] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, A. Y. Ng, Recurrent neural networks for noise reduction in robust ASR. Thirteenth Annual Conference of the International Speech Communication Association, 2012.10.21437/Interspeech.2012-6
Search in Google Scholar Back to article
[20] H. Sak, A. Senior, F. Beaufays, Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128, 2014.10.21437/Interspeech.2014-80
Search in Google Scholar Back to article
[21] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5), 602-610, 2005.10.1016/j.neunet.2005.06.04216112549
Open DOI Search in Google Scholar Back to article
[22] A. Graves, S. Fernández, J. Schmidhuber, Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In: Duch W., Kacprzyk J., Oja E., Zadro˙zny S. (eds) Artificial Neural Networks: Formal Models and Their Applications – ICANN, Lecture Notes in Computer Science, vol. 3697, Springer, Berlin, Heidelberg, 2005.10.1007/11550907_126
Search in Google Scholar Back to article
[23] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, J. Schmidhuber, A novel connectionist system for unconstrained handwriting recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855-868, 2009.10.1109/TPAMI.2008.13719299860
Open DOI Search in Google Scholar Back to article
[24] A. Graves, N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the 31st International Conference on Machine Learning (ICML-14), 1764-1772, 2014.
Search in Google Scholar Back to article
[25] A. Graves, N. Jaitly, A. R. Mohamed, Hybrid speech recognition with deep bidirectional LSTM. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 273-278, December 2013.10.1109/ASRU.2013.6707742
Search in Google Scholar Back to article
[26] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates A. Y. Ng. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
Search in Google Scholar Back to article
[27] H. Xu, G. Chen, D. Povey, S. Khudanpur, Modeling phonetic context with non-random forests for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association, 2015.10.21437/Interspeech.2015-478
Search in Google Scholar Back to article
[28] T. Ko, V. Peddinti, D. Povey, S. Khudanpur, Audio augmentation for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association, 3586-3589, 2015.10.21437/Interspeech.2015-711
Search in Google Scholar Back to article
[29] G. Chen, H. Xu, M. Wu, D. Povey, S. Khudanpur, Pronunciation and silence probability modeling for ASR. Sixteenth Annual Conference of the International Speech Communication Association, 2015.10.21437/Interspeech.2015-198
Search in Google Scholar Back to article
[30] Y. Gaur, F. Metze, J. P. Bigham, Manipulating Word Lattices to Incorporate Human Corrections. Seventeenth Annual Conference of the International Speech Communication Association, 3062-3065, 2016.10.21437/Interspeech.2016-660
Search in Google Scholar Back to article
[31] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, On the properties of neural machine translation: Encoder-decoder approaches, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.10.3115/v1/W14-4012
Search in Google Scholar Back to article
[32] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, Y. Bengio, End-to-end attention-based large vocabulary speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4945-4949, March 2016.10.1109/ICASSP.2016.7472618
Search in Google Scholar Back to article
[33] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory. Neural Comput. 9, 8, 1735-1780, November 1997.10.1162/neco.1997.9.8.17359377276
Search in Google Scholar Back to article
[34] K. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, Technical report, arXiv preprint arXiv:1409.0473, 2014.
Search in Google Scholar Back to article
[35] D. Kingma, J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Search in Google Scholar Back to article
[36] Reduced TED-LIUM release 2 corpus (11.7 GB), http://www.cs.ndsu.nodak.edu/~siludwig/data/TEDLIUM_release2.zip, last retrieved July 2017.
Search in Google Scholar Back to article
[37] Speech recognition performance, https://en.wikipedia.org/wiki/Speech_recognition#Performance, last retrieved July 2017.
Search in Google Scholar Back to article
[38] Levenshtein distance, https://en.wikipedia.org/wiki/Levenshtein_distance, last retrieved July 2017.
Search in Google Scholar Back to article
[39] A. C. Morris, V. Maier, P. Green, From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. Eighth International Conference on Spoken Language Processing, 2004.10.21437/Interspeech.2004-668
Search in Google Scholar Back to article
[40] Word error rate, https://en.wikipedia.org/wiki/Word_error_rate, last retrieved July 2017.
Search in Google Scholar Back to article
[41] A. Marzal, E. Vidal, Computation of normalized edit distance and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9), 926-932, 1993.10.1109/34.232078
Open DOI Search in Google Scholar Back to article

Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU

Abstract

Paradigm

My account