Optimizing Speech to Text Conversion in Turkish: An Analysis of Machine Learning Approaches

Izel Zeynep Gencyilmaz; Kürşat Mustafa Karaoğlan

doi:10.17798/bitlisfen.1434925

Research Article

Optimizing Speech to Text Conversion in Turkish: An Analysis of Machine Learning Approaches

Year 2024, Volume: 13 Issue: 2, 492 - 504, 29.06.2024

Izel Zeynep Gencyilmaz , Kürşat Mustafa Karaoğlan

https://doi.org/10.17798/bitlisfen.1434925

Abstract

The Conversion of Speech to Text (CoST) is crucial for developing automated systems to understand and process voice commands. Studies have focused on developing this task, especially for Turkish-specific voice commands, a strategic language in the international arena. However, researchers face various challenges, such as Turkish's suffixed structure, phonological features and unique letters, dialect and accent differences, word stress, word-initial vowel effects, background noise, gender-based sound variations, and dialectal differences. To address the challenges above, this study aims to convert speech data consisting of Turkish-specific audio clips, which have been limitedly researched in the literature, into texts with high-performance accuracy using different Machine Learning (ML) models, especially models such as Convolutional Neural Networks (CNNs) and Convolutional Recurrent Neural Networks (CRNNs). For this purpose, experimental studies were conducted on a dataset of 26,485 Turkish audio clips, and performance evaluation was performed with various metrics. In addition, hyperparameters were optimized to improve the model's performance in experimental studies. A performance of over 97% has been achieved according to the F1-score metric. The highest performance results were obtained with the CRNN approach. In conclusion, this study provides valuable insights into the strengths and limitations of various ML models applied to CoST. In addition to potentially contributing to a wide range of applications, such as supporting hard-of-hearing individuals, facilitating notetaking, automatic captioning, and improving voice command recognition systems, this study is one of the first in the literature on CoST in Turkish.

Keywords

Natural Language Processing, Convolutional Neural Networks, Convolutional Recurrent Neural Networks, Deep Learning, Speech Recognition, Speech to Text

References

[1] S. McRoy, Principles of natural language processing. Susan McRoy, 2021.
[2] A. Akmajian, A. K. Farmer, L. Bickmore, R. A. Demers, and R. M. Harnish, Linguistics: An introduction to language and communication. MIT press, 2017.
[3] M. Gales, S. Young, and Others, “The application of hidden Markov models in speech recognition,” Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195–304, 2008.
[4] M. L. P. Bueno, A. Hommersom, P. J. F. Lucas, and A. Linard, “Asymmetric hidden Markov models,” International Journal of Approximate Reasoning, vol. 88, pp. 169–191, 2017.
[5] M. S. Barakat, M. E. Gadallah, T. Nazmy, and T. El Arif, “Investigating the effect of speech features and the number of HMM mixtures in the quality HMM-based synthesizers,” in The International Conference on Electrical Engineering, 2008, vol. 6, pp. 1–12.
[6] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” arXiv preprint arXiv:1706. 02737, 2017.
[7] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” 2014.
[8] G. Van Houdt, C. Mosquera, and G. Nápoles, “A review on the long short-term memory model,” Artificial Intelligence Review, vol. 53, no. 8, pp. 5929–5955, 2020.
[9] S. Yang, X. Yu, and Y. Zhou, “Lstm and gru neural network performance comparison study: Taking yelp review dataset as an example,” in 2020 International workshop on electronic communication and artificial intelligence (IWECAI), 2020, pp. 98–101.
[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
[11] M. Li et al., “The deep learning compiler: A comprehensive survey,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 3, pp. 708–727, 2020.
[12] K. Mohamad and K. M. Karaoglan, “Enhancing Deep Learning-Based Sentiment Analysis Using Static and Contextual Language Models,” Bitlis Eren Universitesi Fen Bilimleri Dergisi, vol. 12, no. 3, pp. 712–724, 2023.
[13] A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria, “A review of deep learning techniques for speech processing,” Information Fusion, p. 101869, 2023.
[14] Kurtkaya M, “Turkish Speech Command Dataset,” https://www.kaggle.com/, 2021. [Online]. Available: https://www.kaggle.com/datasets/muratkurtkaya/turkish-speech-command-dataset/data. [Accessed: 18-Apr-2024].
[15] M. Tohyama, Sound and signals. Springer Science & Business Media, 2011.
[16] F. Riehle, Frequency standards: basics and applications. John Wiley & Sons, 2006.
[17] J. O. Smith, “Mathematics of the Discrete Fourier Transform (DFT) with Audio Applications. 2007.” W3K Publishing, 2023.
[18] J. L. Flanagan, Speech analysis synthesis and perception, vol. 3. Springer Science & Business Media, 2013.
[19] S. A. Majeed, H. Husain, S. A. Samad, and T. F. Idbeaa, “Mel Frequency Cepstral Coefficients (MFCC) Feature Extraction Enhancement in the Application of Speech Recognition: A Comparison Study,” Journal of Theoretical & Applied Information Technology, vol. 79, no. 1, 2015.
[20] A. Sithara, A. Thomas, and D. Mathew, “Study of MFCC and IHC feature extraction methods with probabilistic acoustic models for speaker biometric applications,” Procedia computer science, vol. 143, pp. 267–276, 2018.
[21] P. Zinemanas, M. Rocamora, M. Miron, F. Font, and X. Serra, “An interpretable deep learning model for automatic sound classification,” Electronics, vol. 10, no. 7, p. 850, 2021.
[22] Y. Jia et al., “Speaker recognition based on characteristic spectrograms and an improved self-organizing feature map neural network,” Complex & Intelligent Systems, vol. 7, pp. 1749–1757, 2021.
[23] K.-H. N. Bui, H. Oh, and H. Yi, “Traffic density classification using sound datasets: an empirical study on traffic flow at asymmetric roads,” IEEE Access, vol. 8, pp. 125671–125679, 2020.
[24] J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal processing letters, vol. 24, no. 3, pp. 279–283, 2017.
[25] G. James, D. Witten, T. Hastie, R. Tibshirani, and Others, An introduction to statistical learning, vol. 112. Springer, 2013.
[26] R. Yamashita, M. Nishio, R. K. G. Do, and K. Togashi, “Convolutional neural networks: an overview and application in radiology,” Insights into Imaging, vol. 9, pp. 611–629, 2018.
[27] K. O’shea and R. Nash, “An introduction to convolutional neural networks,” arXiv preprint arXiv:1511. 08458, 2015.
[28] A. Adeel, M. Gogate, and A. Hussain, “Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments,” Information Fusion, vol. 59, pp. 163–170, 2020.
[29] H. Tian, S.-C. Chen, and M.-L. Shyu, “Evolutionary programming based deep learning feature selection and network construction for visual data classification,” Information Systems Frontiers, vol. 22, no. 5, pp. 1053–1066, 2020.
[30] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing,” IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55–75, 2018.
[31] G. Koppe, A. Meyer-Lindenberg, and D. Durstewitz, “Deep learning for small and big data in psychiatry,” Neuropsychopharmacology, vol. 46, no. 1, pp. 176–190, 2021.
[32] Y. Chen, L. Li, W. Li, Q. Guo, Z. Du, and Z. Xu, “Chapter 6 - Deep learning processors,” in AI Computing Systems, Y. Chen, L. Li, W. Li, Q. Guo, Z. Du, and Z. Xu, Eds. Morgan Kaufmann, 2024, pp. 207–245.
[33] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, 2014, pp. 818–833.
[34] K. Santosh, N. Das, and S. Ghosh, “Deep learning: a review,” Deep Learning Models for Medical Imaging, pp. 29–63, 2022.
[35] M. M. Taye, “Theoretical understanding of convolutional neural network: Concepts, architectures, applications, future directions,” Computation, vol. 11, no. 3, p. 52, 2023.
[36] Y. Wu, H. Mao, and Z. Yi, “Audio classification using attention-augmented convolutional neural network,” Knowledge-Based Systems, vol. 161, pp. 90–100, 2018.
[37] K. M. Karaoglan and O. Findik, “Enhancing Aspect Category Detection Through Hybridised Contextualised Neural Language Models: A Case Study in Multi-Label Text Classification,” The Computer Journal, p. bxae004, 01 2024.
[38] F. Chollet, Deep learning with Python. Simon and Schuster, 2021.
[39] P. Galeone, Hands-on neural networks with TensorFlow 2.0: understand TensorFlow, from static graph to eager execution, and design neural networks. Packt Publishing Ltd, 2019.
[40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[41] J. Jin, A. Dundar, and E. Culurciello, “Flattened convolutional neural networks for feedforward acceleration,” arXiv preprint arXiv:1412. 5474, 2014.
[42] S. Ruder, “Neural transfer learning for natural language processing,” NUI Galway, 2019.
[43] C. Ozdemir, “Avg-topk: A new pooling method for convolutional neural networks,” Expert Systems with Applications, vol. 223, p. 119892, 2023.
[44] A. G. Ganie and S. Dadvandipour, “From big data to smart data: a sample gradient descent approach for machine learning,” Journal of Big Data, vol. 10, no. 1, p. 162, 2023.
[45] A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern recognition, vol. 30, no. 7, pp. 1145–1159, 1997.
[46] H. Wang and H. Zheng, “Model Cross-Validation,” InEncyclopedia of Systems Biology, 2013.
[47] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012.
[48] G. C. Cawley and N. L. C. Talbot, “On over-fitting in model selection and subsequent selection bias in performance evaluation,” The Journal of Machine Learning Research, vol. 11, pp. 2079–2107, 2010.
[49] Y. Jung and J. Hu, “AK-fold averaging cross-validation procedure,” Journal of nonparametric statistics, vol. 27, no. 2, pp. 167–179, 2015.
[50] G. Jiang and W. Wang, “Error estimation based on variance analysis of k-fold cross-validation,” Pattern Recognition, vol. 69, pp. 94–106, 2017.
[51] R. B. Rao, G. Fung, and R. Rosales, “On the dangers of cross-validation. An experimental evaluation,” in Proceedings of the 2008 SIAM international conference on data mining, 2008, pp. 588–596.
[52] E. Bartz, T. Bartz-Beielstein, M. Zaefferer, and O. Mersmann, Hyperparameter Tuning for Machine and Deep Learning with R: A Practical Guide. Springer Nature, 2023.
[53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412. 6980, 2014.
[54] U. A. Kimanuka and O. Buyuk, “Turkish speech recognition based on deep neural networks,” Suleyman Demirel Universitesi Fen Bilimleri Enstitusu Dergisi, vol. 22, pp. 319–329, 2018.
[55] S. Oyucu, “A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning,” Electronics, vol. 12, no. 8, p. 1900, 2023.

Year 2024, Volume: 13 Issue: 2, 492 - 504, 29.06.2024

Izel Zeynep Gencyilmaz , Kürşat Mustafa Karaoğlan

https://doi.org/10.17798/bitlisfen.1434925

Abstract

References

[1] S. McRoy, Principles of natural language processing. Susan McRoy, 2021.
[2] A. Akmajian, A. K. Farmer, L. Bickmore, R. A. Demers, and R. M. Harnish, Linguistics: An introduction to language and communication. MIT press, 2017.
[3] M. Gales, S. Young, and Others, “The application of hidden Markov models in speech recognition,” Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195–304, 2008.
[4] M. L. P. Bueno, A. Hommersom, P. J. F. Lucas, and A. Linard, “Asymmetric hidden Markov models,” International Journal of Approximate Reasoning, vol. 88, pp. 169–191, 2017.
[5] M. S. Barakat, M. E. Gadallah, T. Nazmy, and T. El Arif, “Investigating the effect of speech features and the number of HMM mixtures in the quality HMM-based synthesizers,” in The International Conference on Electrical Engineering, 2008, vol. 6, pp. 1–12.
[6] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” arXiv preprint arXiv:1706. 02737, 2017.
[7] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” 2014.
[8] G. Van Houdt, C. Mosquera, and G. Nápoles, “A review on the long short-term memory model,” Artificial Intelligence Review, vol. 53, no. 8, pp. 5929–5955, 2020.
[9] S. Yang, X. Yu, and Y. Zhou, “Lstm and gru neural network performance comparison study: Taking yelp review dataset as an example,” in 2020 International workshop on electronic communication and artificial intelligence (IWECAI), 2020, pp. 98–101.
[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
[11] M. Li et al., “The deep learning compiler: A comprehensive survey,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 3, pp. 708–727, 2020.
[12] K. Mohamad and K. M. Karaoglan, “Enhancing Deep Learning-Based Sentiment Analysis Using Static and Contextual Language Models,” Bitlis Eren Universitesi Fen Bilimleri Dergisi, vol. 12, no. 3, pp. 712–724, 2023.
[13] A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria, “A review of deep learning techniques for speech processing,” Information Fusion, p. 101869, 2023.
[14] Kurtkaya M, “Turkish Speech Command Dataset,” https://www.kaggle.com/, 2021. [Online]. Available: https://www.kaggle.com/datasets/muratkurtkaya/turkish-speech-command-dataset/data. [Accessed: 18-Apr-2024].
[15] M. Tohyama, Sound and signals. Springer Science & Business Media, 2011.
[16] F. Riehle, Frequency standards: basics and applications. John Wiley & Sons, 2006.
[17] J. O. Smith, “Mathematics of the Discrete Fourier Transform (DFT) with Audio Applications. 2007.” W3K Publishing, 2023.
[18] J. L. Flanagan, Speech analysis synthesis and perception, vol. 3. Springer Science & Business Media, 2013.
[19] S. A. Majeed, H. Husain, S. A. Samad, and T. F. Idbeaa, “Mel Frequency Cepstral Coefficients (MFCC) Feature Extraction Enhancement in the Application of Speech Recognition: A Comparison Study,” Journal of Theoretical & Applied Information Technology, vol. 79, no. 1, 2015.
[20] A. Sithara, A. Thomas, and D. Mathew, “Study of MFCC and IHC feature extraction methods with probabilistic acoustic models for speaker biometric applications,” Procedia computer science, vol. 143, pp. 267–276, 2018.
[21] P. Zinemanas, M. Rocamora, M. Miron, F. Font, and X. Serra, “An interpretable deep learning model for automatic sound classification,” Electronics, vol. 10, no. 7, p. 850, 2021.
[22] Y. Jia et al., “Speaker recognition based on characteristic spectrograms and an improved self-organizing feature map neural network,” Complex & Intelligent Systems, vol. 7, pp. 1749–1757, 2021.
[23] K.-H. N. Bui, H. Oh, and H. Yi, “Traffic density classification using sound datasets: an empirical study on traffic flow at asymmetric roads,” IEEE Access, vol. 8, pp. 125671–125679, 2020.
[24] J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal processing letters, vol. 24, no. 3, pp. 279–283, 2017.
[25] G. James, D. Witten, T. Hastie, R. Tibshirani, and Others, An introduction to statistical learning, vol. 112. Springer, 2013.
[26] R. Yamashita, M. Nishio, R. K. G. Do, and K. Togashi, “Convolutional neural networks: an overview and application in radiology,” Insights into Imaging, vol. 9, pp. 611–629, 2018.
[27] K. O’shea and R. Nash, “An introduction to convolutional neural networks,” arXiv preprint arXiv:1511. 08458, 2015.
[28] A. Adeel, M. Gogate, and A. Hussain, “Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments,” Information Fusion, vol. 59, pp. 163–170, 2020.
[29] H. Tian, S.-C. Chen, and M.-L. Shyu, “Evolutionary programming based deep learning feature selection and network construction for visual data classification,” Information Systems Frontiers, vol. 22, no. 5, pp. 1053–1066, 2020.
[30] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing,” IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55–75, 2018.
[31] G. Koppe, A. Meyer-Lindenberg, and D. Durstewitz, “Deep learning for small and big data in psychiatry,” Neuropsychopharmacology, vol. 46, no. 1, pp. 176–190, 2021.
[32] Y. Chen, L. Li, W. Li, Q. Guo, Z. Du, and Z. Xu, “Chapter 6 - Deep learning processors,” in AI Computing Systems, Y. Chen, L. Li, W. Li, Q. Guo, Z. Du, and Z. Xu, Eds. Morgan Kaufmann, 2024, pp. 207–245.
[33] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, 2014, pp. 818–833.
[34] K. Santosh, N. Das, and S. Ghosh, “Deep learning: a review,” Deep Learning Models for Medical Imaging, pp. 29–63, 2022.
[35] M. M. Taye, “Theoretical understanding of convolutional neural network: Concepts, architectures, applications, future directions,” Computation, vol. 11, no. 3, p. 52, 2023.
[36] Y. Wu, H. Mao, and Z. Yi, “Audio classification using attention-augmented convolutional neural network,” Knowledge-Based Systems, vol. 161, pp. 90–100, 2018.
[37] K. M. Karaoglan and O. Findik, “Enhancing Aspect Category Detection Through Hybridised Contextualised Neural Language Models: A Case Study in Multi-Label Text Classification,” The Computer Journal, p. bxae004, 01 2024.
[38] F. Chollet, Deep learning with Python. Simon and Schuster, 2021.
[39] P. Galeone, Hands-on neural networks with TensorFlow 2.0: understand TensorFlow, from static graph to eager execution, and design neural networks. Packt Publishing Ltd, 2019.
[40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[41] J. Jin, A. Dundar, and E. Culurciello, “Flattened convolutional neural networks for feedforward acceleration,” arXiv preprint arXiv:1412. 5474, 2014.
[42] S. Ruder, “Neural transfer learning for natural language processing,” NUI Galway, 2019.
[43] C. Ozdemir, “Avg-topk: A new pooling method for convolutional neural networks,” Expert Systems with Applications, vol. 223, p. 119892, 2023.
[44] A. G. Ganie and S. Dadvandipour, “From big data to smart data: a sample gradient descent approach for machine learning,” Journal of Big Data, vol. 10, no. 1, p. 162, 2023.
[45] A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern recognition, vol. 30, no. 7, pp. 1145–1159, 1997.
[46] H. Wang and H. Zheng, “Model Cross-Validation,” InEncyclopedia of Systems Biology, 2013.
[47] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012.
[48] G. C. Cawley and N. L. C. Talbot, “On over-fitting in model selection and subsequent selection bias in performance evaluation,” The Journal of Machine Learning Research, vol. 11, pp. 2079–2107, 2010.
[49] Y. Jung and J. Hu, “AK-fold averaging cross-validation procedure,” Journal of nonparametric statistics, vol. 27, no. 2, pp. 167–179, 2015.
[50] G. Jiang and W. Wang, “Error estimation based on variance analysis of k-fold cross-validation,” Pattern Recognition, vol. 69, pp. 94–106, 2017.
[51] R. B. Rao, G. Fung, and R. Rosales, “On the dangers of cross-validation. An experimental evaluation,” in Proceedings of the 2008 SIAM international conference on data mining, 2008, pp. 588–596.
[52] E. Bartz, T. Bartz-Beielstein, M. Zaefferer, and O. Mersmann, Hyperparameter Tuning for Machine and Deep Learning with R: A Practical Guide. Springer Nature, 2023.
[53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412. 6980, 2014.
[54] U. A. Kimanuka and O. Buyuk, “Turkish speech recognition based on deep neural networks,” Suleyman Demirel Universitesi Fen Bilimleri Enstitusu Dergisi, vol. 22, pp. 319–329, 2018.
[55] S. Oyucu, “A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning,” Electronics, vol. 12, no. 8, p. 1900, 2023.

There are 55 citations in total.

Details

Primary Language	English
Subjects	Natural Language Processing, Artificial Intelligence (Other)
Journal Section	Araştırma Makalesi
Authors	Izel Zeynep Gencyilmaz 0009-0009-0025-3394 Kürşat Mustafa Karaoğlan 0000-0001-9830-7622
Early Pub Date	June 27, 2024
Publication Date	June 29, 2024
Submission Date	February 10, 2024
Acceptance Date	March 20, 2024
Published in Issue	Year 2024 Volume: 13 Issue: 2

Cite

IEEE	I. Z. Gencyilmaz and K. M. Karaoğlan, “Optimizing Speech to Text Conversion in Turkish: An Analysis of Machine Learning Approaches”, Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, vol. 13, no. 2, pp. 492–504, 2024, doi: 10.17798/bitlisfen.1434925.

Download Cover Image

Article Files

Full Text

Bitlis Eren University

Journal of Science Editor

Bitlis Eren University Graduate Institute

Bes Minare Mah. Ahmet Eren Bulvari, Merkez Kampus, 13000 BITLIS

E-mail: fbe@beu.edu.tr