Research Article
BibTex RIS Cite

Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small ve Wav2Vec2-XLS-R-300M

Year 2023, Volume: 16 Issue: 2, 109 - 116, 20.11.2023
https://doi.org/10.54525/tbbmd.1252487

Abstract

Bu çalışmada konuşmadan metne çeviri için önerilmiş ve çok sayıda dille ön eğitilmiş iki model olan Whisper-Small ve Wav2Vec2-XLS-R-300M modellerinin Türkçe dilinde konuşmadan metne çevirme başarıları incelenmiştir. Çalışmada açık kaynaklı bir veri kümesi olan Türkçe dilinde hazırlanmış Mozilla Common Voice 11.0 versiyonu kullanılmıştır. Az sayıda veri içeren bu veri kümesi ile çok dilli modeller olan Whisper-Small ve Wav2Vec2-XLS-R-300M ince ayar yapılmıştır. İki modelin konuşmadan metne çeviri başarımı değerlendirilmiş ve Wav2Vec2-XLS-R-300M modelinin 0,28 WER değeri Whisper-Small modelinin 0,16 WER değeri gösterdiği gözlemlenmiştir. Ek olarak modellerin başarısı eğitim ve doğrulama veri kümesinde bulunmayan çağrı merkezi kayıtlarıyla hazırlanmış sınama verisiyle incelenmiştir.

Supporting Institution

TÜBİTAK TEYDEB 1501

Project Number

3210713

Thanks

Bu çalışma TÜBİTAK TEYDEB 1501 kapsamında desteklenmekte olan 3210713 numaralı "Güncel Derin Öğrenme Mimarileri ile Türkçe Dili için Konuşmadan Metne Çeviri Yapabilen ve Hizmet Olarak Yazılım (SaaS) Modeli ile Çalışan Sistemin Geliştirilmesi" isimli proje kapsamında gerçekleştirilmiştir.

References

  • Özlan, B., Haznedaroğlu, A., Arslan, L. M., Automatic fraud detection in call center conversations, In 2019 27th Signal Processing and Communications Applications Conference (SIU), 2019, pp. 1-4.
  • Dhanjal, A. S., Singh, W. An automatic machine translation system for multi-lingual speech to Indian sign language. multimedia Tools and Applications, 2022, pp.1-39.
  • Ballati, F., Corno, F., De Russis, L., Assessing virtual assistant capabilities with Italian dysarthric speech, In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, 2018, pp. 93-101.
  • Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Kingsbury, B., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research Groups, IEEE Signal processing magazine, 2012, 29(6), pp.82-97.
  • Sainath, T. N., Vinyals, O., Senior, A., Sak, H. Convolutional, long short-term memory, fully connected deep neural networks, IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 4580-4584.
  • Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., Almojil, M., Automatic speech recognition: Systematic literature Review, IEEE Access, 9, 2021, pp.131858-131876.
  • Hellman, E., Nordstrand, M., Research in methods for achieving secure voice anonymization: Evaluation and improvement of voice anonymization techniques for whistleblowing, 2022.
  • Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y., Attention-based models for speech recognition, Advances in neural information processing systems, 2015.
  • Bahar, P., Bieschke, T., Ney, H., A comparative study on end-to-end speech to text translation, Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 792-799.
  • Tang, Y., Pino, J., Wang, C., Ma, X., Genzel, D., A general multi-task learning framework to leverage text data for speech to text tasks, In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6209-6213.
  • Tombaloğlu, B., Erdem, H. A., SVM based speech to text converter for Turkish language, In 2017 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1-4.
  • Kimanuka, U. A., & Buyuk, O., Turkish speech recognition based on deep neural networks, Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 2018, pp.319-329.
  • Tombaloğlu, B., Erdem, H., Deep Learning Based Automatic Speech Recognition for Turkish, Sakarya University Journal of Science, 2020, pp.725-739.
  • Tombaloğlu, B., & Erdem, H., Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU), Gazi University Journal of Science, 2021, pp.1035-1049.
  • Safaya, A., Erzin, E., HuBERT-TR: Reviving Turkish Automatic Speech Recognition with Self-supervised Speech Representation Learning, 2022, arXiv preprint arXiv:2210.07323.
  • Li, Z., Niehues, J., Efficient Speech Translation with Pre-trained Models, 2022, arXiv preprint arXiv:2211.04939.
  • Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations, Advances in Neural Information Processing Systems, 2020, pp. 12449-12460.
  • Vásquez-Correa, J. C., Álvarez Muniain, A., Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper, 2023, Sensors, 23(4), 1843.
  • Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I., Robust speech recognition via large-scale weak supervision, 2022, arXiv preprint arXiv:2212.04356.
  • Taşar D.E., An automatic speech recognition system proposal for organizational development, 782089, Master's thesis, Dokuz Eylul University Management Information Systems, 2023.
  • Mercan, Ö. B., Özdil, U., Ozan, Ş., Çok Dilli Sesten Metne Çeviri Modelinin İnce Ayar Yapılarak Türkçe Dilindeki Başarısının Arttırılması Increasing Performance in Turkish by Finetuning of Multilingual Speech-to-Text Model, 30th Signal Processing and Communications Applications Conference (SIU), 2022, pp. 1-4.
  • Arduengo, J., Köhn, A., The Mozilla Common Voice Corpus. In Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019), 2019, pp. 1823-1827.
  • Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M., Unsupervised cross-lingual representation learning for speech recognition, 2020, arXiv preprint arXiv:2006.13979.
  • Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Auli, M., XLS-R: Self-supervised cross-lingual speech representation learning at scale, 2021, arXiv preprint arXiv:2111.09296.
  • Openai. (2022, December 9). Whisper/model-card.md at main· openai/whisper. GitHub. Retrieved February 5, 2023,fromhttps://github.com/openai/whisper/blob/main/model-card.md
  • Ali, A., Renals, S., Word error rate estimation for speech recognition: e-WER, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 20-24.
  • Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Y., Lexicon-free conversational speech recognition with neural networks, In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 345-354.
  • “wav2vec2-xls-r-300m-tr”,https://huggingface.co/Sercan/wav2vec2-xls-r-300m-tr
  • “whisper-small-tr-2”,https://huggingface.co/Sercan/whisper-small-tr-2

Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish: Whisper-Small and Wav2Vec2-XLS-R-300M

Year 2023, Volume: 16 Issue: 2, 109 - 116, 20.11.2023
https://doi.org/10.54525/tbbmd.1252487

Abstract

In this study, the performances of the Whisper-Small and Wav2Vec2-XLS-R-300M models which are two pre-trained multilingual models for speech to text were examined for the Turkish language. Mozilla Common Voice version 11.0 which is prepared in Turkish language and is an open-source data set, was used in the study. The multilingual models, Whisper-Small and Wav2Vec2-XLS-R-300M were fine-tuned with this data set which contains a small amount of data. The speech to text performance of the two models was compared. WER values are calculated as 0.28 and 0.16 for the Wav2Vec2-XLS-R-300M and the Whisper-Small models respectively. In addition, the performances of the models were examined with the test data prepared with call center records that were not included in the training and validation dataset.

Project Number

3210713

References

  • Özlan, B., Haznedaroğlu, A., Arslan, L. M., Automatic fraud detection in call center conversations, In 2019 27th Signal Processing and Communications Applications Conference (SIU), 2019, pp. 1-4.
  • Dhanjal, A. S., Singh, W. An automatic machine translation system for multi-lingual speech to Indian sign language. multimedia Tools and Applications, 2022, pp.1-39.
  • Ballati, F., Corno, F., De Russis, L., Assessing virtual assistant capabilities with Italian dysarthric speech, In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, 2018, pp. 93-101.
  • Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Kingsbury, B., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research Groups, IEEE Signal processing magazine, 2012, 29(6), pp.82-97.
  • Sainath, T. N., Vinyals, O., Senior, A., Sak, H. Convolutional, long short-term memory, fully connected deep neural networks, IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 4580-4584.
  • Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., Almojil, M., Automatic speech recognition: Systematic literature Review, IEEE Access, 9, 2021, pp.131858-131876.
  • Hellman, E., Nordstrand, M., Research in methods for achieving secure voice anonymization: Evaluation and improvement of voice anonymization techniques for whistleblowing, 2022.
  • Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y., Attention-based models for speech recognition, Advances in neural information processing systems, 2015.
  • Bahar, P., Bieschke, T., Ney, H., A comparative study on end-to-end speech to text translation, Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 792-799.
  • Tang, Y., Pino, J., Wang, C., Ma, X., Genzel, D., A general multi-task learning framework to leverage text data for speech to text tasks, In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6209-6213.
  • Tombaloğlu, B., Erdem, H. A., SVM based speech to text converter for Turkish language, In 2017 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1-4.
  • Kimanuka, U. A., & Buyuk, O., Turkish speech recognition based on deep neural networks, Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 2018, pp.319-329.
  • Tombaloğlu, B., Erdem, H., Deep Learning Based Automatic Speech Recognition for Turkish, Sakarya University Journal of Science, 2020, pp.725-739.
  • Tombaloğlu, B., & Erdem, H., Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU), Gazi University Journal of Science, 2021, pp.1035-1049.
  • Safaya, A., Erzin, E., HuBERT-TR: Reviving Turkish Automatic Speech Recognition with Self-supervised Speech Representation Learning, 2022, arXiv preprint arXiv:2210.07323.
  • Li, Z., Niehues, J., Efficient Speech Translation with Pre-trained Models, 2022, arXiv preprint arXiv:2211.04939.
  • Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations, Advances in Neural Information Processing Systems, 2020, pp. 12449-12460.
  • Vásquez-Correa, J. C., Álvarez Muniain, A., Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper, 2023, Sensors, 23(4), 1843.
  • Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I., Robust speech recognition via large-scale weak supervision, 2022, arXiv preprint arXiv:2212.04356.
  • Taşar D.E., An automatic speech recognition system proposal for organizational development, 782089, Master's thesis, Dokuz Eylul University Management Information Systems, 2023.
  • Mercan, Ö. B., Özdil, U., Ozan, Ş., Çok Dilli Sesten Metne Çeviri Modelinin İnce Ayar Yapılarak Türkçe Dilindeki Başarısının Arttırılması Increasing Performance in Turkish by Finetuning of Multilingual Speech-to-Text Model, 30th Signal Processing and Communications Applications Conference (SIU), 2022, pp. 1-4.
  • Arduengo, J., Köhn, A., The Mozilla Common Voice Corpus. In Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019), 2019, pp. 1823-1827.
  • Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M., Unsupervised cross-lingual representation learning for speech recognition, 2020, arXiv preprint arXiv:2006.13979.
  • Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Auli, M., XLS-R: Self-supervised cross-lingual speech representation learning at scale, 2021, arXiv preprint arXiv:2111.09296.
  • Openai. (2022, December 9). Whisper/model-card.md at main· openai/whisper. GitHub. Retrieved February 5, 2023,fromhttps://github.com/openai/whisper/blob/main/model-card.md
  • Ali, A., Renals, S., Word error rate estimation for speech recognition: e-WER, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 20-24.
  • Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Y., Lexicon-free conversational speech recognition with neural networks, In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 345-354.
  • “wav2vec2-xls-r-300m-tr”,https://huggingface.co/Sercan/wav2vec2-xls-r-300m-tr
  • “whisper-small-tr-2”,https://huggingface.co/Sercan/whisper-small-tr-2
There are 29 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Makaleler(Araştırma)
Authors

Öykü Berfin Mercan 0000-0001-7727-0197

Sercan Çepni 0000-0002-3405-6059

Davut Emre Taşar 0000-0002-7788-0478

Şükrü Ozan 0000-0002-3227-348X

Project Number 3210713
Early Pub Date October 22, 2023
Publication Date November 20, 2023
Published in Issue Year 2023 Volume: 16 Issue: 2

Cite

APA Mercan, Ö. B., Çepni, S., Taşar, D. E., Ozan, Ş. (2023). Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small ve Wav2Vec2-XLS-R-300M. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, 16(2), 109-116. https://doi.org/10.54525/tbbmd.1252487
AMA Mercan ÖB, Çepni S, Taşar DE, Ozan Ş. Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small ve Wav2Vec2-XLS-R-300M. TBV-BBMD. November 2023;16(2):109-116. doi:10.54525/tbbmd.1252487
Chicago Mercan, Öykü Berfin, Sercan Çepni, Davut Emre Taşar, and Şükrü Ozan. “Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small Ve Wav2Vec2-XLS-R-300M”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi 16, no. 2 (November 2023): 109-16. https://doi.org/10.54525/tbbmd.1252487.
EndNote Mercan ÖB, Çepni S, Taşar DE, Ozan Ş (November 1, 2023) Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small ve Wav2Vec2-XLS-R-300M. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 16 2 109–116.
IEEE Ö. B. Mercan, S. Çepni, D. E. Taşar, and Ş. Ozan, “Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small ve Wav2Vec2-XLS-R-300M”, TBV-BBMD, vol. 16, no. 2, pp. 109–116, 2023, doi: 10.54525/tbbmd.1252487.
ISNAD Mercan, Öykü Berfin et al. “Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small Ve Wav2Vec2-XLS-R-300M”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 16/2 (November 2023), 109-116. https://doi.org/10.54525/tbbmd.1252487.
JAMA Mercan ÖB, Çepni S, Taşar DE, Ozan Ş. Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small ve Wav2Vec2-XLS-R-300M. TBV-BBMD. 2023;16:109–116.
MLA Mercan, Öykü Berfin et al. “Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small Ve Wav2Vec2-XLS-R-300M”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, vol. 16, no. 2, 2023, pp. 109-16, doi:10.54525/tbbmd.1252487.
Vancouver Mercan ÖB, Çepni S, Taşar DE, Ozan Ş. Türkçe Konuşmadan Metne Dönüştürme için Ön Eğitimli Modellerin Performans Karşılaştırması: Whisper-Small ve Wav2Vec2-XLS-R-300M. TBV-BBMD. 2023;16(2):109-16.

Article Acceptance

Use user registration/login to upload articles online.

The acceptance process of the articles sent to the journal consists of the following stages:

1. Each submitted article is sent to at least two referees at the first stage.

2. Referee appointments are made by the journal editors. There are approximately 200 referees in the referee pool of the journal and these referees are classified according to their areas of interest. Each referee is sent an article on the subject he is interested in. The selection of the arbitrator is done in a way that does not cause any conflict of interest.

3. In the articles sent to the referees, the names of the authors are closed.

4. Referees are explained how to evaluate an article and are asked to fill in the evaluation form shown below.

5. The articles in which two referees give positive opinion are subjected to similarity review by the editors. The similarity in the articles is expected to be less than 25%.

6. A paper that has passed all stages is reviewed by the editor in terms of language and presentation, and necessary corrections and improvements are made. If necessary, the authors are notified of the situation.

0

.   This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.