Turkish Speech Recognition Based On Deep Neural Networks
Year 2018,
Volume: 22 Issue: Special, 319 - 329, 05.10.2018
Ussen Abre Kımanuka
,
Osman Buyuk
Abstract
In this paper we develop a Turkish speech recognition (SR) system using deep neural networks and compare it with the previous state-of-the-art traditional Gaussian mixture model-hidden Markov model (GMM-HMM) method using the same Turkish speech dataset and the same large vocabulary Turkish corpus. Nowadays most SR systems deployed worldwide and particularly in Turkey use Hidden Markov Models to deal with the speech temporal variations. Gaussian mixture models are used to estimate the amount at which each state of each HMM fits a short frame of coefficients which is the representation of an acoustic input. A deep neural network consisting of feed-forward neural network is another way to estimate the fit; this neural network takes as input several frames of coefficients and gives as output posterior probabilities over HMM states. It has been shown that the use of deep neural networks can outperform the traditional GMM-HMM in other languages such as English and German. The fact that Turkish language is an agglutinative language and the lack of a huge amount of speech data complicate the design of a performant SR system. By making use of deep neural networks we will obviously improve the performance but still we will not achieve better result than English language due to the difference in the availability of speech data. We present various architectural and training techniques for the Turkish DNN-based models. The models are tested using a Turkish database collected from mobile devices. In the experiments, we observe that the Turkish DNN-HMM system have decreased the word error rate approximately 2.5% when compared to the GMM-HMM traditional system.
References
- [1] Baker, J.M., Glass, J., Khudanpur, S., Lee, C.H., Morgan, N., O’Shaugnessy, D. 2009. Research developments and directions in speech recognition and understanding, part 1. IEEE Signal Processing Magazine, vol. 26, no. 3, 75–80.
- [2] Baker, J.M., Glass, J., Khudanpur, S., Lee, C.H., Morgan, N., O’Shaugnessy, D. 2009. Research developments and directions in speech recognition and understanding, part 2. IEEE Signal Processing Magazine, vol. 26, no. 4, 78–85.
- [3] He, X., Deng, L., Chou., W. 2008. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, vol. 25, no.5, 14– 36.
- [4] Valtchev, V., Young, S. J., Kapadia, S. 1993. MMI training for continuous phoneme recognition on the TIMIT database. In Proc. ICASSP, vol.2, 491–494.
- [5] Juang, B. H., Hou, W., Lee, C.H. 1997. Minimum classification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, 257–265.
- [6] McDermott, E., Nakamura, A., Hazen, T.J. 2007. Discriminative training for large vocabulary speech recognition using minimum classification error. IEEE Transactions on Speech and Audio Processing, vol. 15, no. 1, 203–223.
- [7] Povey, D., Woodland, P. 2002. Minimum phone error and i-smoothing for improved discriminative training. In Proc. ICASSP, vol. 1, 105–108.
- [8] Povey, D. 2003. Discriminative training for large vocabulary speech recognition Ph.D. dissertation, Cambridge University Engineering Dept, 13-21.
- [9] Povey, D., Kanesvsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K. 2008. Boosted MMI for model and feature space discriminative training. In Proc. ICASSP, 4057–4060.
- [10] Deng, L. 2016. Deep learning: from speech recognition to language and multimodal processing. APSIPA, vol. 5, 2.
- [11] Bengio, Y., Simard, P., Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw., vol.5, 157–166.
- [12] Deng, L., Hassanein, K., Elmasry, M. 1994. Analysis of correlation structure for a neural predictive model with application to speech Recognition. Neural Networks, vol. 7, no. 2, 331-339.
- [13] Hinton, G.E., Osindero, S., Teh, Y.W. 2006. A fast learning algorithm for deep belief nets. Neural Computation, vol. 18, 1527–1554.
- [14] Mella, O., Fohr, D., Illina, I. 2017. New paradigm in Speech Recognition : Deep Neural Networks. IEEE International Conference on Information Systems and Economic Intelligence, 1-8.
- [15] Arisoy, E. 2004. Turkish Dictation System for Radiology and Broadcast News Applications. Msc. Thesis, Bogazici University, 1-5.
- [16] Buyuk, O. 2005. Sub-word Language Modelling for Turkish Speech Recognition. Msc. Thesis, Sabanci University, 16-23.
- [17] Erdogan, H., Buyuk, O., Oflazer, K. 2005. Incorporating language constraints in sub-word based speech recognition. Automatic Speech Recognition and Understanding, IEEE Workshop on, 98-103.
- [18] Arisoy, E., Saraclar, M. 2016. Compositional Neural Network Language Models for Agglutinative Languages. INTERSPEECH, 3494- 3498.
- [19] Tunca, A. 2010. Digit Sequence Recognition Using Hidden Markov Models and Continuous Speech Recognition Technique. MSc. Thesis, Eskişehir Osmangazi University,20-25.
- [20] Urgun, K. 2012. An isolated word syllable based speech recognition system using ANN. MSc. Thesis, Atilim University,9-10.
- [21] Özlem, Y. 2016. A comparison of word and syllable-based speech recognition systems. MSc Thesis, Adnan Menderes University, 4-6.
- [22] Buyuk, O. 2016. A new database for Turkish speech recognition on mobile devices and initial speech recognition results using the database. Pamukkale University Journal of Engineering Sciences, 1-5.
- [23] Deng, L., Li, X. 2013. Machine Learning Paradigms for Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing, vol. 21, n. 5, 1060-1089.
- [24] Rabiner, L. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, vol. 77, no. 2, 257-86.
- [25] Stolcke, A. 2002. SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing,1-4.
- [26] Hsu, B.J., Glass, J. 2008. Iterative Language Model Estimation :Efficient Data Structure & Algorithms. InProc. Interspeech, 1-4.
- [27] Renals, S. 2017. Automatic Speech Recognition ASR Lecture 11:lexicon and pronunciations, 1-4.
- [28] Dahl, G.E., Yu, D., Deng, L., Acero, A. 2012. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE transactıons on audio, speech, and language processing, vol. 20, no. 1, 35-36.
- [29] Hinton, G., Deng, L., Dahl, G., Mohamed, A., Jaitly, N., Senoir, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition,IEEE Signal Processing Magazine, vol. 82, 1-22.
- [30] Ba, J.L., Kingma, D.P. 2009. ADAM: A Method for Stochastic Optimization. ICLR, 2-15.
- [31] Hinton, G.E., Nair, V. 2009. 3-d object recognition with deep belief nets. in Advances in Neural Information Processing Systems 22, 1339–1347.
- [32] Schmidhuber, J., Hochreiter, S. 1997. Long Short-Term Memory, Neural Computation, vol. 9 no. 8, 1735–1780.
- [33] Josh Meyer, F. 2016. http://jrmeyer.github.io/ (visited on : 22/11/2017).
- [34] Qi, P., Maas, A.L., Xie, Z., Hannun, A.Y., Lengerich, C.T., Jurafsky, D., Ng, A.Y. 2015. Building DNN Acoustic Models for Large Vocabulary Speech Recognition. Computer Speech and Language 41, 195–213.
- [35] Chelba, C., Norouzi, M., Bengio, S. 2017. N-gram language modeling using recurrent neural network estimation. Google Tech Report , 1-4.
- [36] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K. 2011. The Kaldi Speech Recognition Toolkit. IEEE Automatic Speech Recognition and Understanding Workshop(ASRU) in Hawaii, US, 1-4.
- [37] İnik, Ö., Ülker, E. 2017. Data Sets and Software Libraries Used for Deep Learning, Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT).1-6.
- [38] Cömert, Z., Kocamaz, A.F. 2017. A study of artificial neural network training algorithms for classification of cardiotocography signals. Bitlis Eren University journal of science and technology vol.7 no.2, 93-103.
- [39] Povey, D., Zhang, X., Khudanpur, S. 2015. Parallel training of DNNs with natural gradient parameter averaging. ICLR, 1-12.
Year 2018,
Volume: 22 Issue: Special, 319 - 329, 05.10.2018
Ussen Abre Kımanuka
,
Osman Buyuk
References
- [1] Baker, J.M., Glass, J., Khudanpur, S., Lee, C.H., Morgan, N., O’Shaugnessy, D. 2009. Research developments and directions in speech recognition and understanding, part 1. IEEE Signal Processing Magazine, vol. 26, no. 3, 75–80.
- [2] Baker, J.M., Glass, J., Khudanpur, S., Lee, C.H., Morgan, N., O’Shaugnessy, D. 2009. Research developments and directions in speech recognition and understanding, part 2. IEEE Signal Processing Magazine, vol. 26, no. 4, 78–85.
- [3] He, X., Deng, L., Chou., W. 2008. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, vol. 25, no.5, 14– 36.
- [4] Valtchev, V., Young, S. J., Kapadia, S. 1993. MMI training for continuous phoneme recognition on the TIMIT database. In Proc. ICASSP, vol.2, 491–494.
- [5] Juang, B. H., Hou, W., Lee, C.H. 1997. Minimum classification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, 257–265.
- [6] McDermott, E., Nakamura, A., Hazen, T.J. 2007. Discriminative training for large vocabulary speech recognition using minimum classification error. IEEE Transactions on Speech and Audio Processing, vol. 15, no. 1, 203–223.
- [7] Povey, D., Woodland, P. 2002. Minimum phone error and i-smoothing for improved discriminative training. In Proc. ICASSP, vol. 1, 105–108.
- [8] Povey, D. 2003. Discriminative training for large vocabulary speech recognition Ph.D. dissertation, Cambridge University Engineering Dept, 13-21.
- [9] Povey, D., Kanesvsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K. 2008. Boosted MMI for model and feature space discriminative training. In Proc. ICASSP, 4057–4060.
- [10] Deng, L. 2016. Deep learning: from speech recognition to language and multimodal processing. APSIPA, vol. 5, 2.
- [11] Bengio, Y., Simard, P., Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw., vol.5, 157–166.
- [12] Deng, L., Hassanein, K., Elmasry, M. 1994. Analysis of correlation structure for a neural predictive model with application to speech Recognition. Neural Networks, vol. 7, no. 2, 331-339.
- [13] Hinton, G.E., Osindero, S., Teh, Y.W. 2006. A fast learning algorithm for deep belief nets. Neural Computation, vol. 18, 1527–1554.
- [14] Mella, O., Fohr, D., Illina, I. 2017. New paradigm in Speech Recognition : Deep Neural Networks. IEEE International Conference on Information Systems and Economic Intelligence, 1-8.
- [15] Arisoy, E. 2004. Turkish Dictation System for Radiology and Broadcast News Applications. Msc. Thesis, Bogazici University, 1-5.
- [16] Buyuk, O. 2005. Sub-word Language Modelling for Turkish Speech Recognition. Msc. Thesis, Sabanci University, 16-23.
- [17] Erdogan, H., Buyuk, O., Oflazer, K. 2005. Incorporating language constraints in sub-word based speech recognition. Automatic Speech Recognition and Understanding, IEEE Workshop on, 98-103.
- [18] Arisoy, E., Saraclar, M. 2016. Compositional Neural Network Language Models for Agglutinative Languages. INTERSPEECH, 3494- 3498.
- [19] Tunca, A. 2010. Digit Sequence Recognition Using Hidden Markov Models and Continuous Speech Recognition Technique. MSc. Thesis, Eskişehir Osmangazi University,20-25.
- [20] Urgun, K. 2012. An isolated word syllable based speech recognition system using ANN. MSc. Thesis, Atilim University,9-10.
- [21] Özlem, Y. 2016. A comparison of word and syllable-based speech recognition systems. MSc Thesis, Adnan Menderes University, 4-6.
- [22] Buyuk, O. 2016. A new database for Turkish speech recognition on mobile devices and initial speech recognition results using the database. Pamukkale University Journal of Engineering Sciences, 1-5.
- [23] Deng, L., Li, X. 2013. Machine Learning Paradigms for Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing, vol. 21, n. 5, 1060-1089.
- [24] Rabiner, L. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, vol. 77, no. 2, 257-86.
- [25] Stolcke, A. 2002. SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing,1-4.
- [26] Hsu, B.J., Glass, J. 2008. Iterative Language Model Estimation :Efficient Data Structure & Algorithms. InProc. Interspeech, 1-4.
- [27] Renals, S. 2017. Automatic Speech Recognition ASR Lecture 11:lexicon and pronunciations, 1-4.
- [28] Dahl, G.E., Yu, D., Deng, L., Acero, A. 2012. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE transactıons on audio, speech, and language processing, vol. 20, no. 1, 35-36.
- [29] Hinton, G., Deng, L., Dahl, G., Mohamed, A., Jaitly, N., Senoir, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition,IEEE Signal Processing Magazine, vol. 82, 1-22.
- [30] Ba, J.L., Kingma, D.P. 2009. ADAM: A Method for Stochastic Optimization. ICLR, 2-15.
- [31] Hinton, G.E., Nair, V. 2009. 3-d object recognition with deep belief nets. in Advances in Neural Information Processing Systems 22, 1339–1347.
- [32] Schmidhuber, J., Hochreiter, S. 1997. Long Short-Term Memory, Neural Computation, vol. 9 no. 8, 1735–1780.
- [33] Josh Meyer, F. 2016. http://jrmeyer.github.io/ (visited on : 22/11/2017).
- [34] Qi, P., Maas, A.L., Xie, Z., Hannun, A.Y., Lengerich, C.T., Jurafsky, D., Ng, A.Y. 2015. Building DNN Acoustic Models for Large Vocabulary Speech Recognition. Computer Speech and Language 41, 195–213.
- [35] Chelba, C., Norouzi, M., Bengio, S. 2017. N-gram language modeling using recurrent neural network estimation. Google Tech Report , 1-4.
- [36] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K. 2011. The Kaldi Speech Recognition Toolkit. IEEE Automatic Speech Recognition and Understanding Workshop(ASRU) in Hawaii, US, 1-4.
- [37] İnik, Ö., Ülker, E. 2017. Data Sets and Software Libraries Used for Deep Learning, Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT).1-6.
- [38] Cömert, Z., Kocamaz, A.F. 2017. A study of artificial neural network training algorithms for classification of cardiotocography signals. Bitlis Eren University journal of science and technology vol.7 no.2, 93-103.
- [39] Povey, D., Zhang, X., Khudanpur, S. 2015. Parallel training of DNNs with natural gradient parameter averaging. ICLR, 1-12.