Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small ve Wav2Vec2-XLS-R-300M
Yıl 2023,
Cilt: 16 Sayı: 2, 109 - 116, 20.11.2023
Öykü Berfin Mercan
,
Sercan Çepni
,
Davut Emre Taşar
,
Şükrü Ozan
Öz
Bu çalışmada konuşmadan metne çeviri için önerilmiş ve çok sayıda dille ön eğitilmiş iki model olan Whisper-Small ve Wav2Vec2-XLS-R-300M modellerinin Türkçe dilinde konuşmadan metne çevirme başarıları incelenmiştir. Çalışmada açık kaynaklı bir veri kümesi olan Türkçe dilinde hazırlanmış Mozilla Common Voice 11.0 versiyonu kullanılmıştır. Az sayıda veri içeren bu veri kümesi ile çok dilli modeller olan Whisper-Small ve Wav2Vec2-XLS-R-300M ince ayar yapılmıştır. İki modelin konuşmadan metne çeviri başarımı değerlendirilmiş ve Wav2Vec2-XLS-R-300M modelinin 0,28 WER değeri Whisper-Small modelinin 0,16 WER değeri gösterdiği gözlemlenmiştir. Ek olarak modellerin başarısı eğitim ve doğrulama veri kümesinde bulunmayan çağrı merkezi kayıtlarıyla hazırlanmış sınama verisiyle incelenmiştir.
Destekleyen Kurum
TÜBİTAK TEYDEB 1501
Teşekkür
Bu çalışma TÜBİTAK TEYDEB 1501 kapsamında desteklenmekte olan 3210713 numaralı "Güncel Derin Öğrenme Mimarileri ile Türkçe Dili için Konuşmadan Metne Çeviri Yapabilen ve Hizmet Olarak Yazılım (SaaS) Modeli ile Çalışan Sistemin Geliştirilmesi" isimli proje kapsamında gerçekleştirilmiştir.
Kaynakça
- Özlan, B., Haznedaroğlu, A., Arslan, L. M., Automatic fraud detection in call center conversations, In 2019 27th Signal Processing and Communications Applications Conference (SIU), 2019, pp. 1-4.
- Dhanjal, A. S., Singh, W. An automatic machine translation system for multi-lingual speech to Indian sign language. multimedia Tools and Applications, 2022, pp.1-39.
- Ballati, F., Corno, F., De Russis, L., Assessing virtual assistant capabilities with Italian dysarthric speech, In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, 2018, pp. 93-101.
- Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Kingsbury, B., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research Groups, IEEE Signal processing magazine, 2012, 29(6), pp.82-97.
- Sainath, T. N., Vinyals, O., Senior, A., Sak, H. Convolutional, long short-term memory, fully connected deep neural networks, IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 4580-4584.
- Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., Almojil, M., Automatic speech recognition: Systematic literature Review, IEEE Access, 9, 2021, pp.131858-131876.
- Hellman, E., Nordstrand, M., Research in methods for achieving secure voice anonymization: Evaluation and improvement of voice anonymization techniques for whistleblowing, 2022.
- Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y., Attention-based models for speech recognition, Advances in neural information processing systems, 2015.
- Bahar, P., Bieschke, T., Ney, H., A comparative study on end-to-end speech to text translation, Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 792-799.
- Tang, Y., Pino, J., Wang, C., Ma, X., Genzel, D., A general multi-task learning framework to leverage text data for speech to text tasks, In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6209-6213.
- Tombaloğlu, B., Erdem, H. A., SVM based speech to text converter for Turkish language, In 2017 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1-4.
- Kimanuka, U. A., & Buyuk, O., Turkish speech recognition based on deep neural networks, Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 2018, pp.319-329.
- Tombaloğlu, B., Erdem, H., Deep Learning Based Automatic Speech Recognition for Turkish, Sakarya University Journal of Science, 2020, pp.725-739.
- Tombaloğlu, B., & Erdem, H., Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU), Gazi University Journal of Science, 2021, pp.1035-1049.
- Safaya, A., Erzin, E., HuBERT-TR: Reviving Turkish Automatic Speech Recognition with Self-supervised Speech Representation Learning, 2022, arXiv preprint arXiv:2210.07323.
- Li, Z., Niehues, J., Efficient Speech Translation with Pre-trained Models, 2022, arXiv preprint arXiv:2211.04939.
- Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations, Advances in Neural Information Processing Systems, 2020, pp. 12449-12460.
- Vásquez-Correa, J. C., Álvarez Muniain, A., Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper, 2023, Sensors, 23(4), 1843.
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I., Robust speech recognition via large-scale weak supervision, 2022, arXiv preprint arXiv:2212.04356.
- Taşar D.E., An automatic speech recognition system proposal for organizational development, 782089, Master's thesis, Dokuz Eylul University Management Information Systems, 2023.
- Mercan, Ö. B., Özdil, U., Ozan, Ş., Çok Dilli Sesten Metne Çeviri Modelinin İnce Ayar Yapılarak Türkçe Dilindeki Başarısının Arttırılması Increasing Performance in Turkish by Finetuning of Multilingual Speech-to-Text Model, 30th Signal Processing and Communications Applications Conference (SIU), 2022, pp. 1-4.
- Arduengo, J., Köhn, A., The Mozilla Common Voice Corpus. In Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019), 2019, pp. 1823-1827.
- Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M., Unsupervised cross-lingual representation learning for speech recognition, 2020, arXiv preprint arXiv:2006.13979.
- Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Auli, M., XLS-R: Self-supervised cross-lingual speech representation learning at scale, 2021, arXiv preprint arXiv:2111.09296.
- Openai. (2022, December 9). Whisper/model-card.md at main· openai/whisper. GitHub. Retrieved February 5, 2023,fromhttps://github.com/openai/whisper/blob/main/model-card.md
- Ali, A., Renals, S., Word error rate estimation for speech recognition: e-WER, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 20-24.
- Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Y., Lexicon-free conversational speech recognition with neural networks, In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 345-354.
- “wav2vec2-xls-r-300m-tr”,https://huggingface.co/Sercan/wav2vec2-xls-r-300m-tr
- “whisper-small-tr-2”,https://huggingface.co/Sercan/whisper-small-tr-2
Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish: Whisper-Small and Wav2Vec2-XLS-R-300M
Yıl 2023,
Cilt: 16 Sayı: 2, 109 - 116, 20.11.2023
Öykü Berfin Mercan
,
Sercan Çepni
,
Davut Emre Taşar
,
Şükrü Ozan
Öz
In this study, the performances of the Whisper-Small and Wav2Vec2-XLS-R-300M models which are two pre-trained multilingual models for speech to text were examined for the Turkish language. Mozilla Common Voice version 11.0 which is prepared in Turkish language and is an open-source data set, was used in the study. The multilingual models, Whisper-Small and Wav2Vec2-XLS-R-300M were fine-tuned with this data set which contains a small amount of data. The speech to text performance of the two models was compared. WER values are calculated as 0.28 and 0.16 for the Wav2Vec2-XLS-R-300M and the Whisper-Small models respectively. In addition, the performances of the models were examined with the test data prepared with call center records that were not included in the training and validation dataset.
Kaynakça
- Özlan, B., Haznedaroğlu, A., Arslan, L. M., Automatic fraud detection in call center conversations, In 2019 27th Signal Processing and Communications Applications Conference (SIU), 2019, pp. 1-4.
- Dhanjal, A. S., Singh, W. An automatic machine translation system for multi-lingual speech to Indian sign language. multimedia Tools and Applications, 2022, pp.1-39.
- Ballati, F., Corno, F., De Russis, L., Assessing virtual assistant capabilities with Italian dysarthric speech, In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, 2018, pp. 93-101.
- Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Kingsbury, B., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research Groups, IEEE Signal processing magazine, 2012, 29(6), pp.82-97.
- Sainath, T. N., Vinyals, O., Senior, A., Sak, H. Convolutional, long short-term memory, fully connected deep neural networks, IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 4580-4584.
- Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., Almojil, M., Automatic speech recognition: Systematic literature Review, IEEE Access, 9, 2021, pp.131858-131876.
- Hellman, E., Nordstrand, M., Research in methods for achieving secure voice anonymization: Evaluation and improvement of voice anonymization techniques for whistleblowing, 2022.
- Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y., Attention-based models for speech recognition, Advances in neural information processing systems, 2015.
- Bahar, P., Bieschke, T., Ney, H., A comparative study on end-to-end speech to text translation, Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 792-799.
- Tang, Y., Pino, J., Wang, C., Ma, X., Genzel, D., A general multi-task learning framework to leverage text data for speech to text tasks, In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6209-6213.
- Tombaloğlu, B., Erdem, H. A., SVM based speech to text converter for Turkish language, In 2017 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1-4.
- Kimanuka, U. A., & Buyuk, O., Turkish speech recognition based on deep neural networks, Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 2018, pp.319-329.
- Tombaloğlu, B., Erdem, H., Deep Learning Based Automatic Speech Recognition for Turkish, Sakarya University Journal of Science, 2020, pp.725-739.
- Tombaloğlu, B., & Erdem, H., Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU), Gazi University Journal of Science, 2021, pp.1035-1049.
- Safaya, A., Erzin, E., HuBERT-TR: Reviving Turkish Automatic Speech Recognition with Self-supervised Speech Representation Learning, 2022, arXiv preprint arXiv:2210.07323.
- Li, Z., Niehues, J., Efficient Speech Translation with Pre-trained Models, 2022, arXiv preprint arXiv:2211.04939.
- Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations, Advances in Neural Information Processing Systems, 2020, pp. 12449-12460.
- Vásquez-Correa, J. C., Álvarez Muniain, A., Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper, 2023, Sensors, 23(4), 1843.
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I., Robust speech recognition via large-scale weak supervision, 2022, arXiv preprint arXiv:2212.04356.
- Taşar D.E., An automatic speech recognition system proposal for organizational development, 782089, Master's thesis, Dokuz Eylul University Management Information Systems, 2023.
- Mercan, Ö. B., Özdil, U., Ozan, Ş., Çok Dilli Sesten Metne Çeviri Modelinin İnce Ayar Yapılarak Türkçe Dilindeki Başarısının Arttırılması Increasing Performance in Turkish by Finetuning of Multilingual Speech-to-Text Model, 30th Signal Processing and Communications Applications Conference (SIU), 2022, pp. 1-4.
- Arduengo, J., Köhn, A., The Mozilla Common Voice Corpus. In Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019), 2019, pp. 1823-1827.
- Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M., Unsupervised cross-lingual representation learning for speech recognition, 2020, arXiv preprint arXiv:2006.13979.
- Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Auli, M., XLS-R: Self-supervised cross-lingual speech representation learning at scale, 2021, arXiv preprint arXiv:2111.09296.
- Openai. (2022, December 9). Whisper/model-card.md at main· openai/whisper. GitHub. Retrieved February 5, 2023,fromhttps://github.com/openai/whisper/blob/main/model-card.md
- Ali, A., Renals, S., Word error rate estimation for speech recognition: e-WER, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 20-24.
- Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Y., Lexicon-free conversational speech recognition with neural networks, In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 345-354.
- “wav2vec2-xls-r-300m-tr”,https://huggingface.co/Sercan/wav2vec2-xls-r-300m-tr
- “whisper-small-tr-2”,https://huggingface.co/Sercan/whisper-small-tr-2