TÜRKÇE DİLİNDE YAZILAN BİLİMSEL METİNLERİN DERİN ÖĞRENME TEKNİĞİ UYGULANARAK ÇOKLU SINIFLANDIRILMASI

Mustafa Özkan; Görkem Kar

doi:10.21923/jesd.973181

Research Article

MULTICLASS CLASSIFICATION OF SCIENTIFIC TEXTS WRITTEN IN TURKISH BY APPLYING DEEP LEARNING TECHNIQUE

Year 2022, Volume: 10 Issue: 2, 504 - 519, 30.06.2022

Mustafa Özkan , Görkem Kar

https://doi.org/10.21923/jesd.973181

Cited By: 12

Abstract

The BERT deep learning technique, which is developed by Google in October 2018, has become very popular in the world of machine learning and natural language processing. BERT, which stands for Bidirectional Encoder Representations of Transformers, can be explained as a natural language processing technique that uses artificial intelligence and machine learning technologies together. Nowadays, classification problems that are part of the supervised learning methodology are frequently encountered. Classification is based on the ability of a trained machine to predict and classify new data. The purpose is to distribute data between classes defined on a dataset. In Turkish many of the difficulties arise from being an agglutinative language and having a rich but complex morphology. These difficulties cause hard to solving multiclass classification problems. However, it has become more easily solvable with using BERT deep learning technique. We used academic research and scientific studies written in Turkish in the last 10 years as our dataset. We fine-tuned our dataset on a pre-trained Turkish BERT model by applying BERT deep learning technique to use in multiclass classification problems. As a result of experiments, it is seen that the accuracy of the system we have trained has achieved 96% accuracy.

Keywords

Machine Learning , Natural Language Processing , Deep Learning , Multiclass Classification , BERT

References

Acikalin, U. U., Bardak, B., & Kutlu, M. (2020). Turkish Sentiment Analysis Using BERT. In 2020 28th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Akin, S. E., & Yildiz, T. (2019, July). Sentiment Analysis through Transfer Learning for Turkish Language. In 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA) (pp. 1-6). IEEE.
BERTurk. (2020). https://github.com/stefan-it/turkish-bert. (Erişim Tarihi:30.01.2021)
Bisong, E. (2019). Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform (pp. 59-64). Apress, Berkeley, CA.
Chandra, R. V., & Varanasi, B. S. (2015). Python requests essentials. Packt Publishing Ltd.
Çoban, Ö., İnan, A., & Özel, S. A. (2021). Facebook Tells Me Your Gender: An Exploratory Study of Gender Prediction for Turkish Facebook Users. Transactions on Asian and Low-Resource Language Information Processing, 20(4), 1-38.
Deng, L., & Yu, D. (2014). Deep learning: methods and applications. Foundations and trends in signal processing, 7(3–4), 197-387.
Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168-189.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Google Research Bert. (2018). https://github.com/google-research/bert (Erişim Tarihi:07.02.2021)
Jia, Z., Maggioni, M., Smith, J., & Scarpazza, D. P. (2019). Dissecting the NVidia Turing T4 GPU via microbenchmarking. arXiv preprint arXiv:1903.07486.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kraipeerapun, P. (2009). Neural network classification based on quantification of uncertainty (Doctoral dissertation, Murdoch University).
Lee, J. J. (2013). Mechanize: Stateful programmatic web browsing in Python. http://wwwsearch.sourceforge.net/mechanize/ (Erişim Tarihi:17.01.2021)
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Madabushi, H. T., Kochkina, E., & Castelle, M. (2020). Cost-sensitive BERT for generalisable sentence classification with imbalanced data. arXiv preprint arXiv:2003.11563.
Opitz, J., & Burst, S. (2019). Macro f1 and macro f1. arXiv preprint arXiv:1911.03347.
Özçift, A., Akarsu, K., Yumuk, F., & Söylemez, C. (2021). Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish. Automatika, 1-13.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
Richardson, L. (2007). Beautiful soup documentation. https://www. crummy. com/software/BeautifulSoup/bs4/doc/. (Erişim Tarihi:15.01.2021)
Schachinger, K. (2017). A Complete Guide to the Google RankBrain Algorithm. Search Engine Journal.
Sevli, O , Kemaloğlu, N . (2021). Olağandışı Olaylar Hakkındaki Tweet’lerin Gerçek ve Gerçek Dışı Olarak Google BERT Modeli ile Sınıflandırılması . Veri Bilimi , 4 (1) , 31-37 .
Song, K., Tan, X., Qin, T., Lu, J., & Liu, T. Y. (2020). Mpnet: Masked and permuted pre-training for language understanding. arXiv preprint arXiv:2004.09297.
Şahin, G., & Diri, B. (2021, June). The Effect of Transfer Learning on Turkish Text Classification. In 2021 29th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Tantuğ, A. C. (2016). Metin Sınıflandırma. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 5(2).
Tuzcu, S. (2020). Çevrimiçi Kullanıcı Yorumlarının Duygu Analizi ile Sınıflandırılması. Eskişehir Türk Dünyası Uygulama ve Araştırma Merkezi Bilişim Dergisi, 1(2), 1-5.
Uçan, A., Dörterler, M., & Akçapınar Sezer, E. (2021). A study of Turkish emotion classification with pretrained language models. Journal of Information Science, 0165551520985507.
What’s New In Python 3.7. (2018). https://docs.python.org/3.7/whatsnew/3.7.html (Erişim Tarihi:18.04.2021)

TÜRKÇE DİLİNDE YAZILAN BİLİMSEL METİNLERİN DERİN ÖĞRENME TEKNİĞİ UYGULANARAK ÇOKLU SINIFLANDIRILMASI

Year 2022, Volume: 10 Issue: 2, 504 - 519, 30.06.2022

Mustafa Özkan , Görkem Kar

https://doi.org/10.21923/jesd.973181

Cited By: 12

Abstract

Ekim 2018 yılında Google tarafından geliştirilen BERT derin öğrenme tekniği, makine öğrenimi ve doğal dil işleme dünyasında çok popüler oldu. Transformatörlerin Çift Yönlü Kodlayıcı Gösterimleri anlamına gelen BERT, yapay zeka ve makine öğrenimi teknolojilerini bir arada kullanan bir doğal dil işleme tekniği olarak açıklanabilir. Günümüzde, gözetimli öğrenme metodolojisinin bir parçası olan sınıflandırma problemleriyle çokça karşılaşılmaktadır. Sınıflandırmanın temeli eğitilen bir makinenin yeni gelen bir veri hakkında tahminleme yapabilmesine ve sınıflandırabilmesine dayanır. Buradaki amaç bir veri kümesi üzerinde tanımlı olan sınıflar arasında veriyi dağıtabilmektir. Türkçe'nin morfolojisinin zengin ama karmaşık olması, sondan eklemeli bir dil olması ve dil bilgisinden kaynaklanan zorluklar çoklu sınıflandırma problemlerinin çözümünde başlıca sorun teşkil etmekte iken BERT derin öğrenme tekniği ile bu sorun daha kolay çözülebilir hale gelmiştir. Bu çalışmada, son 10 yıl içinde Türkçe dili ile yazılmış akademik araştırma ve bilimsel çalışmalar veri seti olarak kullanıldı. Çoklu sınıflandırma problemlerinde kullanmak üzere, veri setine BERT derin öğrenme tekniği uygulanarak önceden eğitilmiş Türkçe bir BERT modeli üzerinde ince ayar (fine-tuning) yapıldı. Deneylerin sonucunda, eğitilmiş olan sistemin doğruluğu %96 başarım oranına sahip olmuştur.

Keywords

Makine Öğrenmesi , Doğal Dil İşleme , Derin Öğrenme , Çok Sınıflı Sınıflandırma , BERT

References

Acikalin, U. U., Bardak, B., & Kutlu, M. (2020). Turkish Sentiment Analysis Using BERT. In 2020 28th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Akin, S. E., & Yildiz, T. (2019, July). Sentiment Analysis through Transfer Learning for Turkish Language. In 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA) (pp. 1-6). IEEE.
BERTurk. (2020). https://github.com/stefan-it/turkish-bert. (Erişim Tarihi:30.01.2021)
Bisong, E. (2019). Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform (pp. 59-64). Apress, Berkeley, CA.
Chandra, R. V., & Varanasi, B. S. (2015). Python requests essentials. Packt Publishing Ltd.
Çoban, Ö., İnan, A., & Özel, S. A. (2021). Facebook Tells Me Your Gender: An Exploratory Study of Gender Prediction for Turkish Facebook Users. Transactions on Asian and Low-Resource Language Information Processing, 20(4), 1-38.
Deng, L., & Yu, D. (2014). Deep learning: methods and applications. Foundations and trends in signal processing, 7(3–4), 197-387.
Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168-189.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Google Research Bert. (2018). https://github.com/google-research/bert (Erişim Tarihi:07.02.2021)
Jia, Z., Maggioni, M., Smith, J., & Scarpazza, D. P. (2019). Dissecting the NVidia Turing T4 GPU via microbenchmarking. arXiv preprint arXiv:1903.07486.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kraipeerapun, P. (2009). Neural network classification based on quantification of uncertainty (Doctoral dissertation, Murdoch University).
Lee, J. J. (2013). Mechanize: Stateful programmatic web browsing in Python. http://wwwsearch.sourceforge.net/mechanize/ (Erişim Tarihi:17.01.2021)
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Madabushi, H. T., Kochkina, E., & Castelle, M. (2020). Cost-sensitive BERT for generalisable sentence classification with imbalanced data. arXiv preprint arXiv:2003.11563.
Opitz, J., & Burst, S. (2019). Macro f1 and macro f1. arXiv preprint arXiv:1911.03347.
Özçift, A., Akarsu, K., Yumuk, F., & Söylemez, C. (2021). Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish. Automatika, 1-13.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
Richardson, L. (2007). Beautiful soup documentation. https://www. crummy. com/software/BeautifulSoup/bs4/doc/. (Erişim Tarihi:15.01.2021)
Schachinger, K. (2017). A Complete Guide to the Google RankBrain Algorithm. Search Engine Journal.
Sevli, O , Kemaloğlu, N . (2021). Olağandışı Olaylar Hakkındaki Tweet’lerin Gerçek ve Gerçek Dışı Olarak Google BERT Modeli ile Sınıflandırılması . Veri Bilimi , 4 (1) , 31-37 .
Song, K., Tan, X., Qin, T., Lu, J., & Liu, T. Y. (2020). Mpnet: Masked and permuted pre-training for language understanding. arXiv preprint arXiv:2004.09297.
Şahin, G., & Diri, B. (2021, June). The Effect of Transfer Learning on Turkish Text Classification. In 2021 29th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Tantuğ, A. C. (2016). Metin Sınıflandırma. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 5(2).
Tuzcu, S. (2020). Çevrimiçi Kullanıcı Yorumlarının Duygu Analizi ile Sınıflandırılması. Eskişehir Türk Dünyası Uygulama ve Araştırma Merkezi Bilişim Dergisi, 1(2), 1-5.
Uçan, A., Dörterler, M., & Akçapınar Sezer, E. (2021). A study of Turkish emotion classification with pretrained language models. Journal of Information Science, 0165551520985507.
What’s New In Python 3.7. (2018). https://docs.python.org/3.7/whatsnew/3.7.html (Erişim Tarihi:18.04.2021)

There are 29 citations in total.

Details

Primary Language	Turkish
Subjects	Computer Software
Journal Section	Research Article
Authors	Mustafa Özkan 0000-0003-4287-9220 Görkem Kar This is me 0000-0003-0367-4409
Submission Date	July 19, 2021
Acceptance Date	December 28, 2021
Publication Date	June 30, 2022
Published in Issue	Year 2022 Volume: 10 Issue: 2

Cite

APA	Özkan, M., & Kar, G. (2022). TÜRKÇE DİLİNDE YAZILAN BİLİMSEL METİNLERİN DERİN ÖĞRENME TEKNİĞİ UYGULANARAK ÇOKLU SINIFLANDIRILMASI. Mühendislik Bilimleri Ve Tasarım Dergisi, 10(2), 504-519. https://doi.org/10.21923/jesd.973181