Research Article
BibTex RIS Cite

Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU)

Year 2021, Volume: 34 Issue: 4, 1035 - 1049, 01.12.2021
https://doi.org/10.35378/gujs.816499

Abstract

A typical solution of Automatic Speech Recognition (ASR) problems is realized by feature extraction, feature classification, acoustic modeling and language modeling steps. In classification and modeling steps, Deep Learning Methods have become popular and give more successful recognition results than conventional methods. In this study, an application for solving ASR problem in Turkish Language has been developed. The data sets and studies related to Turkish Language ASR problem are examined. Language models in the ASR problems of agglutative language groups such as Turkish, Finnish and Hungarian are examined. Subword based model is chosen in order not to decrease recognition performance and prevent large vocabulary. The recogniton performance is increased by Deep Learning Methods called Long-Short Term Memory (LSTM) Neural Networks and Gated Recurrent Unit (GRU) in the classification and acoustic modeling steps. The recognition performances of systems including LSTM and GRU are compared with the the previous studies using traditional methods and Deep Neural Networks. When the results were evaluated, it is seen that LSTM and GRU based Speech Recognizers performs better than the recognizers with previous methods. Final Word Error Rate (WER) values were obtained for LSTM and GRU as 10,65% and 11,25%, respectively. GRU based systems have similar performance when compared to LSTM based systems. However, it has been observed that the training periods are short. Computation times are 73.518 and 61.020 seconds respectively. The study gave detailed information about the applicability of the latest methods to Turkish ASR research and applications.

References

  • [1] Shewalkar, N., Nyavanandi, D., Ludwig, S. A., “Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU”, Journal of Artificial Intelligence and Soft Computing Research, 9(4): 235-245, (2019).
  • [2] Kang J., Zhang, W., Liu, J., “Gated Recurrent Units Based Hybrid Acoustic Models for Robust Speech Recognition”, 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Conference, Tianjin, (2016).
  • [3] Dridi, H., Ouni, K., “Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT”, International Journal of Advanced Computer Science and Applications (IJACSA), 11(4): 525-534, (2020).
  • [4] Tombaloğlu B., Erdem H., “Deep Learning Based Automatic Speech Recognition for Turkish”, Sakarya University Journal of Science, 24(4): 725 – 739, (2020).
  • [5] Kimanuka, U , Buyuk, O . "Turkish Speech Recognition Based On Deep Neural Networks" . Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 22: 319-329, (2018).
  • [6]Graves, A., Mohamed, A. R., Hinton, G., “Speech Recognition with Deep Recurrent Neural Networks”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Conference, Vancouver, 6645- 6649, (2013).
  • [7]Arslan, R., S., Barışçı, N., “A Detailed Survey of Turkish Automatic Speech Recognition”, Turkish Journal of Electrical Engineering & Computer Sciences, 28: 3253-3269, (2020).
  • [8]Siri Team, “Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant”, machinelearning.apple.com, https://machinelearning.apple.com/research/hey-siri, (Accessed: 01.07. 2021).
  • [9]Beaufays, F., “The neural networks behind Google Voice transcription”, ai.googleblog.com, https://ai.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html, (Accessed: 01.07. 2021).
  • [10]Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B., “Deep Neural Networks for Acoustic Modelling in Speech Recognition”, IEEE Signal Processing Magazine, 29(6): 82-97, (2012).
  • [11] Graves, A., Jaitly, N., “Towards End to End Speech Recognition with Recurrent Neural Networks”, 31st International Conference on Machine Learning, Conference, Beijing, 1764-1772, (2014).
  • [12] Huang, K., Hussain, A., Wang, Q., Zhang, R., “Deep Learning: Fundamentals, Theory and Applications”, Springer, Edinburg, (2019).
  • [13] Ravanelli, M., Parcollet, T., Bengio, Y., “The Pytorch-Kaldi Speech Recognition Toolkit”, 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Conference, Brighton, (2018).
  • [14] Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y., “Light Gated Recurrent Units for Speech Recognition”, IEEE Journal Of Emerging Topics In Computational Intelligence, 2(2): 92-102, (2018).
  • [15] Işık, G., Artuner, H., “Turkish Dialect Recognition In Terms Of Prosodic By Long Short-Term Memory Neural Networks”, Journal of the Faculty of Engineering and Architecture of Gazi University, 35(1): 213-224, (2020).
  • [16] Arslan, R., S., Barışçı, N., “The Effect of Different Optimization Techniques on End-to-End Turkish Speech Recognition Systems that use Connectionist Temporal Classification”, 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Conference, Turkey, (2018).
  • [17] Arısoy, E., Saraclar, M., “Multi-Stream Long Short-Term Memory Neural Network Language Model”, 16th Annual Conference of the International-Speech-Communication-Association (INTERSPEECH 2015), Conference, Dresden, 1413-1417, (2015).
  • [18] Arısoy, E., Saraclar, M., “Lattice Extension and Vocabulary Adaptation for Turkish LVCSR”, IEEE Transactıons on Audio, Speech and Language Processıng, 17(1): 183-173, (2009).
  • [19] Salor, O., Pellom, B. L., Çiloğlu, T., Demirekler M., “Turkish Speech Corpora and Recognition Tools Developed by Porting SONIC: (Towards multilingual speech recognition)”, Computer Speech and Language, 21, 580–593, (2007).
  • [20] Ruan, W., Gan Z., B Liu., Guo Y., “An Improved Tibetan Lhasa Speech Recognition Method Based on Deep Neural Network”, 10th International Conference on Intelligent Computation Technology and Automation, Conference, Changsha, 303-306, (2017).
  • [21] Bayer, A. O., Çiloglu, T., Yondem, M. T., “Investigation of Different Language Models for Turkish Speech Recognition”, IEEE 14th Signal Processing and Communications Applications, Conference, Antalya, (2006).
  • [22] Muda, L., Begam M., Elamvazuthi, I., “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques”, Journal of Computing, 2(3): 138-143, (2010).
  • [23]Stuttle, M., N., “A Gaussian Mixture Model Spectral Representation for Speech Recognition”, Ph.D. Thesis, Cambridge University, 45-46, (2003).
  • [24]Schiopu, D., “Using Statistical Methods in a Speech Recognition System for Romanian Language”, 12th IFAC Conference on Programmable Devices and Embedded Systems, Conference, Czech Republic, 99-103, (2013).
  • [25] Aksoylar, C., Mutluergil, S., Erdoğan H., “Bir Konuşma Tanıma Sisteminin Anatomisi”, IEEE 17th Signal Processing and Communications Applications, Conference, Antalya, 512-515, (2009).
  • [26] Dhankar, A., “Study of Deep Learning and CMU Sphinx in Automatic Speech Recognition”, International Conference on Advances in Computing, Communications and Informatics (ICACCI), Conference, Udupi, 2296-2301, (2017).
  • [27] Guan, Y., Yuan, Z., Sun, G., Cong, J., “Fpga-based accelerator for Long Short-Term Memory Recurrent Neural Networks”, 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), Conference, Chiba, 629–634, (2017).
  • [28] Hochreiter, S., Schmidhuber, J., “Long Short-Term Memory”, Natural Computation, 9(8): 1735-1780, (1997).
  • [29] Graves, A., Schmidhuberab J., , “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures”, International Joint Conference on Neural Networks (IJCNN), Conference, Montreal, 602-610, (2005).
  • [30] Tunalı, V., “A Speaker Dependent Large Vocabulary Isolated Word Speech Recognition System for Turkish”, Msc. Thesis, Marmara University, 25-26, (2005).
  • [31] Büyük, O. “Sub-Word Language Modeling for Turkish Speech Recognition”, Msc. Thesis, Sabanci University, 29-30, (2005).
  • [32] Arısoy, E., Arslan, L., M., “Turkish Dictating System for Broadcast News Applications”, 13th European Signal Processing, Conference, Antalya, (2005).
  • [33] Aksungurlu, T., Parlak, S., Sak, H., Saraçlar M., “Comparison of Language Modelling Approaches for Turkish Broadcast News”, 16th Signal Processing, Communication and Applications, Conference, Aydın, (2008).
  • [34] Varjokallio M., , Kurimo M., , Virpioja S., , “Learning a Subword Vocabulary Based on Unigram Likelihood”, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Workshop, Czech Republic, 7-12, (2013).
  • [35] Mihajlik, P., Tüske, Z., Tárjan, B., Németh B., Fegyó T., “Improved Recognition of Spontaneous Hungarian Speech-Morphological and Acoustic Modeling Techniques for a Less Resourced Task.” IEEE Transactions On Audio, Speech and Language Processing, 18(6): 1588-1600, (2010).
  • [36] Arısoy, E., Dutagacı, H., Saraclar, M., “A unified language model for large vocabulary continuous speech recognition of Turkish”, Signal Processing, 86: 2844-2862, (2006).
  • [37] Dutagacı, H, “Statistical Language Models for Large Vocabulary Turkish Speech Recognition”, Msc. Thesis, Boğaziçi University, 20-22, (2002).
  • [38] Arısoy, E., Saraclar, M., “Turkish Speech Recognition”, Turkish Natural Language Processing, Springer, (2018).
  • [39] Polat, H., Oyucu, S., “Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results”, Symmetry, 12(2): 290, (2020).
  • [40] Keser, S., Edizkan, R., “Phoneme-Based Isolated Turkish Word Recognition With Subspace Classifier”, IEEE 17th Signal Processing and Communications Applications, Conference, Antalya, 93-96, (2009).
  • [41] Susman, D., Köprü, S., Yazıcı, A., “Turkish Large Vocabulary Continuous Speech Recognition By Using Limited Audio Corpus”, IEEE 20th Signal Processing and Communications Applications Conference (SIU), Conference, Mugla, (2012).
  • [42] Yadava, G T., Jayanna, H S., “Creating Language and Acoustic Models using Kaldi to Build An Automatic Speech Recognition System for Kannada Language”, 2nd IEEE International Conference On Recent Trends in Electronics Inf.&Comm.Tech.(RTEICT), Conference, India, 161-165, (2017).
Year 2021, Volume: 34 Issue: 4, 1035 - 1049, 01.12.2021
https://doi.org/10.35378/gujs.816499

Abstract

References

  • [1] Shewalkar, N., Nyavanandi, D., Ludwig, S. A., “Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU”, Journal of Artificial Intelligence and Soft Computing Research, 9(4): 235-245, (2019).
  • [2] Kang J., Zhang, W., Liu, J., “Gated Recurrent Units Based Hybrid Acoustic Models for Robust Speech Recognition”, 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Conference, Tianjin, (2016).
  • [3] Dridi, H., Ouni, K., “Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT”, International Journal of Advanced Computer Science and Applications (IJACSA), 11(4): 525-534, (2020).
  • [4] Tombaloğlu B., Erdem H., “Deep Learning Based Automatic Speech Recognition for Turkish”, Sakarya University Journal of Science, 24(4): 725 – 739, (2020).
  • [5] Kimanuka, U , Buyuk, O . "Turkish Speech Recognition Based On Deep Neural Networks" . Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 22: 319-329, (2018).
  • [6]Graves, A., Mohamed, A. R., Hinton, G., “Speech Recognition with Deep Recurrent Neural Networks”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Conference, Vancouver, 6645- 6649, (2013).
  • [7]Arslan, R., S., Barışçı, N., “A Detailed Survey of Turkish Automatic Speech Recognition”, Turkish Journal of Electrical Engineering & Computer Sciences, 28: 3253-3269, (2020).
  • [8]Siri Team, “Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant”, machinelearning.apple.com, https://machinelearning.apple.com/research/hey-siri, (Accessed: 01.07. 2021).
  • [9]Beaufays, F., “The neural networks behind Google Voice transcription”, ai.googleblog.com, https://ai.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html, (Accessed: 01.07. 2021).
  • [10]Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B., “Deep Neural Networks for Acoustic Modelling in Speech Recognition”, IEEE Signal Processing Magazine, 29(6): 82-97, (2012).
  • [11] Graves, A., Jaitly, N., “Towards End to End Speech Recognition with Recurrent Neural Networks”, 31st International Conference on Machine Learning, Conference, Beijing, 1764-1772, (2014).
  • [12] Huang, K., Hussain, A., Wang, Q., Zhang, R., “Deep Learning: Fundamentals, Theory and Applications”, Springer, Edinburg, (2019).
  • [13] Ravanelli, M., Parcollet, T., Bengio, Y., “The Pytorch-Kaldi Speech Recognition Toolkit”, 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Conference, Brighton, (2018).
  • [14] Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y., “Light Gated Recurrent Units for Speech Recognition”, IEEE Journal Of Emerging Topics In Computational Intelligence, 2(2): 92-102, (2018).
  • [15] Işık, G., Artuner, H., “Turkish Dialect Recognition In Terms Of Prosodic By Long Short-Term Memory Neural Networks”, Journal of the Faculty of Engineering and Architecture of Gazi University, 35(1): 213-224, (2020).
  • [16] Arslan, R., S., Barışçı, N., “The Effect of Different Optimization Techniques on End-to-End Turkish Speech Recognition Systems that use Connectionist Temporal Classification”, 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Conference, Turkey, (2018).
  • [17] Arısoy, E., Saraclar, M., “Multi-Stream Long Short-Term Memory Neural Network Language Model”, 16th Annual Conference of the International-Speech-Communication-Association (INTERSPEECH 2015), Conference, Dresden, 1413-1417, (2015).
  • [18] Arısoy, E., Saraclar, M., “Lattice Extension and Vocabulary Adaptation for Turkish LVCSR”, IEEE Transactıons on Audio, Speech and Language Processıng, 17(1): 183-173, (2009).
  • [19] Salor, O., Pellom, B. L., Çiloğlu, T., Demirekler M., “Turkish Speech Corpora and Recognition Tools Developed by Porting SONIC: (Towards multilingual speech recognition)”, Computer Speech and Language, 21, 580–593, (2007).
  • [20] Ruan, W., Gan Z., B Liu., Guo Y., “An Improved Tibetan Lhasa Speech Recognition Method Based on Deep Neural Network”, 10th International Conference on Intelligent Computation Technology and Automation, Conference, Changsha, 303-306, (2017).
  • [21] Bayer, A. O., Çiloglu, T., Yondem, M. T., “Investigation of Different Language Models for Turkish Speech Recognition”, IEEE 14th Signal Processing and Communications Applications, Conference, Antalya, (2006).
  • [22] Muda, L., Begam M., Elamvazuthi, I., “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques”, Journal of Computing, 2(3): 138-143, (2010).
  • [23]Stuttle, M., N., “A Gaussian Mixture Model Spectral Representation for Speech Recognition”, Ph.D. Thesis, Cambridge University, 45-46, (2003).
  • [24]Schiopu, D., “Using Statistical Methods in a Speech Recognition System for Romanian Language”, 12th IFAC Conference on Programmable Devices and Embedded Systems, Conference, Czech Republic, 99-103, (2013).
  • [25] Aksoylar, C., Mutluergil, S., Erdoğan H., “Bir Konuşma Tanıma Sisteminin Anatomisi”, IEEE 17th Signal Processing and Communications Applications, Conference, Antalya, 512-515, (2009).
  • [26] Dhankar, A., “Study of Deep Learning and CMU Sphinx in Automatic Speech Recognition”, International Conference on Advances in Computing, Communications and Informatics (ICACCI), Conference, Udupi, 2296-2301, (2017).
  • [27] Guan, Y., Yuan, Z., Sun, G., Cong, J., “Fpga-based accelerator for Long Short-Term Memory Recurrent Neural Networks”, 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), Conference, Chiba, 629–634, (2017).
  • [28] Hochreiter, S., Schmidhuber, J., “Long Short-Term Memory”, Natural Computation, 9(8): 1735-1780, (1997).
  • [29] Graves, A., Schmidhuberab J., , “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures”, International Joint Conference on Neural Networks (IJCNN), Conference, Montreal, 602-610, (2005).
  • [30] Tunalı, V., “A Speaker Dependent Large Vocabulary Isolated Word Speech Recognition System for Turkish”, Msc. Thesis, Marmara University, 25-26, (2005).
  • [31] Büyük, O. “Sub-Word Language Modeling for Turkish Speech Recognition”, Msc. Thesis, Sabanci University, 29-30, (2005).
  • [32] Arısoy, E., Arslan, L., M., “Turkish Dictating System for Broadcast News Applications”, 13th European Signal Processing, Conference, Antalya, (2005).
  • [33] Aksungurlu, T., Parlak, S., Sak, H., Saraçlar M., “Comparison of Language Modelling Approaches for Turkish Broadcast News”, 16th Signal Processing, Communication and Applications, Conference, Aydın, (2008).
  • [34] Varjokallio M., , Kurimo M., , Virpioja S., , “Learning a Subword Vocabulary Based on Unigram Likelihood”, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Workshop, Czech Republic, 7-12, (2013).
  • [35] Mihajlik, P., Tüske, Z., Tárjan, B., Németh B., Fegyó T., “Improved Recognition of Spontaneous Hungarian Speech-Morphological and Acoustic Modeling Techniques for a Less Resourced Task.” IEEE Transactions On Audio, Speech and Language Processing, 18(6): 1588-1600, (2010).
  • [36] Arısoy, E., Dutagacı, H., Saraclar, M., “A unified language model for large vocabulary continuous speech recognition of Turkish”, Signal Processing, 86: 2844-2862, (2006).
  • [37] Dutagacı, H, “Statistical Language Models for Large Vocabulary Turkish Speech Recognition”, Msc. Thesis, Boğaziçi University, 20-22, (2002).
  • [38] Arısoy, E., Saraclar, M., “Turkish Speech Recognition”, Turkish Natural Language Processing, Springer, (2018).
  • [39] Polat, H., Oyucu, S., “Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results”, Symmetry, 12(2): 290, (2020).
  • [40] Keser, S., Edizkan, R., “Phoneme-Based Isolated Turkish Word Recognition With Subspace Classifier”, IEEE 17th Signal Processing and Communications Applications, Conference, Antalya, 93-96, (2009).
  • [41] Susman, D., Köprü, S., Yazıcı, A., “Turkish Large Vocabulary Continuous Speech Recognition By Using Limited Audio Corpus”, IEEE 20th Signal Processing and Communications Applications Conference (SIU), Conference, Mugla, (2012).
  • [42] Yadava, G T., Jayanna, H S., “Creating Language and Acoustic Models using Kaldi to Build An Automatic Speech Recognition System for Kannada Language”, 2nd IEEE International Conference On Recent Trends in Electronics Inf.&Comm.Tech.(RTEICT), Conference, India, 161-165, (2017).
There are 42 citations in total.

Details

Primary Language English
Subjects Engineering
Journal Section Electrical & Electronics Engineering
Authors

Burak Tombaloğlu 0000-0003-3994-0422

Hamit Erdem 0000-0003-1704-1581

Publication Date December 1, 2021
Published in Issue Year 2021 Volume: 34 Issue: 4

Cite

APA Tombaloğlu, B., & Erdem, H. (2021). Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU). Gazi University Journal of Science, 34(4), 1035-1049. https://doi.org/10.35378/gujs.816499
AMA Tombaloğlu B, Erdem H. Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU). Gazi University Journal of Science. December 2021;34(4):1035-1049. doi:10.35378/gujs.816499
Chicago Tombaloğlu, Burak, and Hamit Erdem. “Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU)”. Gazi University Journal of Science 34, no. 4 (December 2021): 1035-49. https://doi.org/10.35378/gujs.816499.
EndNote Tombaloğlu B, Erdem H (December 1, 2021) Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU). Gazi University Journal of Science 34 4 1035–1049.
IEEE B. Tombaloğlu and H. Erdem, “Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU)”, Gazi University Journal of Science, vol. 34, no. 4, pp. 1035–1049, 2021, doi: 10.35378/gujs.816499.
ISNAD Tombaloğlu, Burak - Erdem, Hamit. “Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU)”. Gazi University Journal of Science 34/4 (December 2021), 1035-1049. https://doi.org/10.35378/gujs.816499.
JAMA Tombaloğlu B, Erdem H. Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU). Gazi University Journal of Science. 2021;34:1035–1049.
MLA Tombaloğlu, Burak and Hamit Erdem. “Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU)”. Gazi University Journal of Science, vol. 34, no. 4, 2021, pp. 1035-49, doi:10.35378/gujs.816499.
Vancouver Tombaloğlu B, Erdem H. Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU). Gazi University Journal of Science. 2021;34(4):1035-49.

Cited By