Türkçe Günlük Kelime ve İfadeler Kullanarak CNN ve LSTM ile Görsel Konuşma Tanıma

Ali Berkol; Nergis Pervan Akman; Talya Tümer Sivri; Hamit Erdem

Research Article

Visual Speech Recognition Using CNN and LSTM with Turkish Daily Words and Phrases

Year 2024, Volume: 8 Issue: 2, 69 - 75, 22.12.2024

Ali Berkol , Nergis Pervan Akman , Talya Tümer Sivri , Hamit Erdem

Abstract

Contemplating a speaker’s face to evaluate speech patterns, movements, gestures, and expressions can be described as lip reading. Gaining the ability to lip reading to computers is a growing research area and has open problems for classification and pattern recognition in deep learning. In the last years, various methods have been developed and applied in different languages to classify and convert speech to text. Moreover, most methods have combined multi-model data, i.e., speech and image. This study aims to provide new Turkish lip-reading data with only images and provide a high-accuracy classification method for Turkish daily words. Data was collected from the YouTube platform. Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) models were trained to classify daily words and phrases with this challenging data. According to numerous experiment results, the CNN model worked better. Using only images, not multi-modal data, prevents the memory from fatigue and decreases the computation time. Furthermore, we provide a multiclass dataset in Turkish since there is a narrow variety in the literature.

Keywords

lip reading , multiclass classification , Turkish dataset , deep learning

Supporting Institution

Aselsan-Bites

References

[1] C. G. Fisher. “Confusions among visually perceived consonants.” Journal of Speech, Language, and Hearing Research, 11(4) pp. 796–804, Dec. 1968.
[2] R. D. Easton and M. Basala. “Perceptual dominance during lipreading”. Perception and Psychophysics, 32(6) pp.562–570, Nov. 1982.
[3] Cecilia Tejedor, A. Leer en los labios. Manual práctico para entrenamiento de la comprensión labiolectora. Madrid: CEPE, 2000.
[4] Shrestha, K. (n.d.). “Lip Reading using Neural Network and Deep learning.” 1802.
[5] T. Ozcan, and A. Basturk, “Lip Reading Using Convolutional Neural Networks with and without Pre-Trained Models.” Balkan Journal of Electrical and Computer Engineering, vol. 7(2) pp. 195-201, Apr. 2019.
[6] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild.” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6447-6456.
[7] Chitu, A., Rothkrantz, L. “Visual Speech Recognition Automatic System for Lip Reading of Dutch”. Journal on Information Technologies and Control, vol. 7, no. 3, pp. 2-9, 2009.
[8] K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, T. Darrell “Visual Speech Recognition with Loosely Synchronized Feature Streams,” in Proceedings of the 10th International Conference on Computer Vision, 2005, pp.1424–1431.
[9] K. Iwano, T. Yoshinaga, S. Tamura, S. Furui. “Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images”, Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing vol. 2007, pp.1-9, 2007
[10] S. Fenghour, D. Chen, K. Guo, B. Li, and P. Xiao, “Deep learning- based automated lip-reading: A survey,” IEEE Access, vol. 9, pp. 121184–121205, 2021.
[11] M. Faisal, and S. Manzoor, “Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language”. CoRR, 2018. DOI: https://doi.org/10.48550/arXiv.1802.05521
[12] L. Pandey and A. S. Arif, “LipType: A Silent Speech Recognizer Augmented with an Independent Repair Model.” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, 2021 Article 1, pp. 1–19. DOI: https://doi.org/10.1145/3411764.3445565
[13] Y. Lu and H. Li, “Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory”. Applied Sciences. 9(8) 1599. 2019. DOI: https://doi.org/10.3390/app9081599
[14] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv:1803.01271, 2018.
[15] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks.” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. pp. 6319-6323 https://doi.org/10.1109/icassp40776.2020.9053841
[16] G. Amit, J. Noyola, and S. Bagadia. “Lip reading using CNN and LSTM”. Stanford University, CS231n project report, 2016.
[17] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas. “Lipnet: End-to-End Sentence-Level Lipreading.” Dec. 2016. http://arxiv.org/abs/1611.01599
[18] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” in ICML, 2006 pp. 369–376.
[19] M. Cooke, J. Barker, S. Cunningham, and X. Shao. “An audio-visual corpus for speech perception and automatic speech recognition.” The Journal of the Acoustical Society of America, vol. 120(5) pp. 2421–2424, 2006. DOI: https://doi.org/10.1121/1.22290.
[20] https://doi.org/10.17632/4t8vs4dr4v.1.
[21] OpenCV Team. (2024). *OpenCV: Open Source Computer Vision Library*. Version 4.6.0. https://opencv.org/
[22] King, D. E. (2009). *dlib: A C++ Library for Machine Learning*. Version 19.24. http://dlib.net/
[23] Jittakoti, A., & Phumeechanya, S. (2024, March). Temporal Keyframe Technique based on CNN and LSTM for Enhancing Lip Reading Performance. In 2024 12th International Electrical Engineering Congress (iEECON) (pp. 1-5). IEEE.
[24] Shashidhar, R., Shashank, M. P., & Sahana, B. (2023). Enhancing visual speech recognition for deaf individuals: a hybrid LSTM and CNN 3D model for improved accuracy. Arabian Journal for Science and Engineering, 1-17.
[25] Pourmousa, H., & Özen, Ü. (2022). LIP READING USING CNN FOR TURKISH NUMBERS. Journal of Business in The Digital Age, 5(2), 155-160.

Türkçe Günlük Kelime ve İfadeler Kullanarak CNN ve LSTM ile Görsel Konuşma Tanıma

Year 2024, Volume: 8 Issue: 2, 69 - 75, 22.12.2024

Ali Berkol , Nergis Pervan Akman , Talya Tümer Sivri , Hamit Erdem

Abstract

Dudak okuma; el hareketleri, jestler ve yüz ifadeleri gibi konuşma örüntülerini, hareketlerini ve mimiklerini değerlendirmek amacıyla bir konuşmacının yüzünü incelemek olarak tanımlanmaktadır. Bilgisayarlara dudak okuma yeteneği kazandırma çalışmaları, derin öğrenmede sınıflandırma ve örüntü tanıma alanında büyüyen bir araştırma alanıdır ve günümüzde hâlâ çözülmesi gereken açık problemler barındırmaktadır. Son yıllarda, farklı dillerde konuşmayı metne dönüştürmek ve sınıflandırmak için çeşitli yöntemler geliştirilmiş ve uygulanmıştır. Ayrıca, çoğu yöntemde çok modlu veriler, yani konuşma ve görüntü verileri birleştirilmiştir. Bu çalışma, görüntülerle yeni Türkçe dudak okuma verileri sağlamayı ve Türkçe günlük kelimeler için yüksek doğrulukta bir sınıflandırma yöntemi sunmayı amaçlamaktadır. Kullanılan veriler, YouTube platformundan toplanmıştır. Bu zorlu verilerle, günlük kelimeleri ve ifadeleri sınıflandırmak için Evrişimli Sinir Ağı (Convolutional Neural Network – CNN) ve Uzun Kısa-Süreli Bellek (Long Short-Term Memory – LSTM) eğitilmiştir. Birçok deney sonucuna göre, CNN modeli daha iyi performans göstermiştir. Çoklu model verileri kullanmadan yalnızca görüntüler kullanmak, belleğin yorgunluğunu önler ve hesaplama süresini azaltır. Ayrıca, literatürde sınırlı bir çeşitlilik olduğundan, bu çalışma çok sınıflı Türkçe bir veri seti sunmaktadır.

Keywords

dudak okuma , çoklu sınıf sınıflandırma , Türkçe veri seti , derin öğrenme

Supporting Institution

Aselsan-Bites

References

[1] C. G. Fisher. “Confusions among visually perceived consonants.” Journal of Speech, Language, and Hearing Research, 11(4) pp. 796–804, Dec. 1968.
[2] R. D. Easton and M. Basala. “Perceptual dominance during lipreading”. Perception and Psychophysics, 32(6) pp.562–570, Nov. 1982.
[3] Cecilia Tejedor, A. Leer en los labios. Manual práctico para entrenamiento de la comprensión labiolectora. Madrid: CEPE, 2000.
[4] Shrestha, K. (n.d.). “Lip Reading using Neural Network and Deep learning.” 1802.
[5] T. Ozcan, and A. Basturk, “Lip Reading Using Convolutional Neural Networks with and without Pre-Trained Models.” Balkan Journal of Electrical and Computer Engineering, vol. 7(2) pp. 195-201, Apr. 2019.
[6] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild.” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6447-6456.
[7] Chitu, A., Rothkrantz, L. “Visual Speech Recognition Automatic System for Lip Reading of Dutch”. Journal on Information Technologies and Control, vol. 7, no. 3, pp. 2-9, 2009.
[8] K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, T. Darrell “Visual Speech Recognition with Loosely Synchronized Feature Streams,” in Proceedings of the 10th International Conference on Computer Vision, 2005, pp.1424–1431.
[9] K. Iwano, T. Yoshinaga, S. Tamura, S. Furui. “Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images”, Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing vol. 2007, pp.1-9, 2007
[10] S. Fenghour, D. Chen, K. Guo, B. Li, and P. Xiao, “Deep learning- based automated lip-reading: A survey,” IEEE Access, vol. 9, pp. 121184–121205, 2021.
[11] M. Faisal, and S. Manzoor, “Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language”. CoRR, 2018. DOI: https://doi.org/10.48550/arXiv.1802.05521
[12] L. Pandey and A. S. Arif, “LipType: A Silent Speech Recognizer Augmented with an Independent Repair Model.” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, 2021 Article 1, pp. 1–19. DOI: https://doi.org/10.1145/3411764.3445565
[13] Y. Lu and H. Li, “Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory”. Applied Sciences. 9(8) 1599. 2019. DOI: https://doi.org/10.3390/app9081599
[14] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv:1803.01271, 2018.
[15] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks.” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. pp. 6319-6323 https://doi.org/10.1109/icassp40776.2020.9053841
[16] G. Amit, J. Noyola, and S. Bagadia. “Lip reading using CNN and LSTM”. Stanford University, CS231n project report, 2016.
[17] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas. “Lipnet: End-to-End Sentence-Level Lipreading.” Dec. 2016. http://arxiv.org/abs/1611.01599
[18] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” in ICML, 2006 pp. 369–376.
[19] M. Cooke, J. Barker, S. Cunningham, and X. Shao. “An audio-visual corpus for speech perception and automatic speech recognition.” The Journal of the Acoustical Society of America, vol. 120(5) pp. 2421–2424, 2006. DOI: https://doi.org/10.1121/1.22290.
[20] https://doi.org/10.17632/4t8vs4dr4v.1.
[21] OpenCV Team. (2024). *OpenCV: Open Source Computer Vision Library*. Version 4.6.0. https://opencv.org/
[22] King, D. E. (2009). *dlib: A C++ Library for Machine Learning*. Version 19.24. http://dlib.net/
[23] Jittakoti, A., & Phumeechanya, S. (2024, March). Temporal Keyframe Technique based on CNN and LSTM for Enhancing Lip Reading Performance. In 2024 12th International Electrical Engineering Congress (iEECON) (pp. 1-5). IEEE.
[24] Shashidhar, R., Shashank, M. P., & Sahana, B. (2023). Enhancing visual speech recognition for deaf individuals: a hybrid LSTM and CNN 3D model for improved accuracy. Arabian Journal for Science and Engineering, 1-17.
[25] Pourmousa, H., & Özen, Ü. (2022). LIP READING USING CNN FOR TURKISH NUMBERS. Journal of Business in The Digital Age, 5(2), 155-160.

There are 25 citations in total.

Details

Primary Language	Turkish
Subjects	Speech Recognition
Journal Section	Articles
Authors	Ali Berkol 0000-0002-3056-1226 Nergis Pervan Akman 0000-0003-3241-6812 Talya Tümer Sivri 0000-0003-1813-5539 Hamit Erdem 0000-0003-1704-1581
Early Pub Date	December 9, 2024
Publication Date	December 22, 2024
Submission Date	July 31, 2024
Acceptance Date	September 13, 2024
Published in Issue	Year 2024 Volume: 8 Issue: 2

Cite

IEEE	A. Berkol, N. Pervan Akman, T. Tümer Sivri, and H. Erdem, “Türkçe Günlük Kelime ve İfadeler Kullanarak CNN ve LSTM ile Görsel Konuşma Tanıma”, IJMSIT, vol. 8, no. 2, pp. 69–75, 2024.

Download Cover Image

Article Files

Full Text