Effects of Feature Extraction Techniques on Classification of Turkish Texts

Özge Akdoğan; Selma Ayşe Özel

doi:10.21605/cukurovaummfd.637643

Araştırma Makalesi

Effects of Feature Extraction Techniques on Classification of Turkish Texts

Yıl 2019, Cilt: 34 Sayı: 3, 95 - 108, 30.09.2019

Özge Akdoğan Selma Ayşe Özel

https://doi.org/10.21605/cukurovaummfd.637643

Cited By: 2

https://izlik.org/JA32LS29SK

Öz

Feature extraction is the most important preprocessing step of text classification task. Effects of preprocessing techniques on text mining for English have been extensively studied. However, studies for Turkish are limited and generally belong to a specific problem domain. In this study, we investigate the effects of feature extraction techniques on four different Turkish text classification problems including news classification, spam e-mail detection, sentiment analysis, and author detection to show the differences and similarities among the problems. We also propose a new feature selection method to reduce feature space. The experimental analysis has showed that, stopword removal improves classification performance. However, stemming does not make any positive effect on classification accuracy. The most successful term weighting methods are tf and tf*idf. The proposed feature selection method improves classification performance and has higher accuracy than the well-known methods.

Anahtar Kelimeler

Text classification , Preprocessing methods , Feature extraction , Turkish texts

Kaynakça

1. Hand, D., Mannila, H., Smyth, P., 2001. Principles of Data Mining, the MIT Press, England, 546.
2. İlhan, S., Duru, N., Karagöz, Ş., Sağır, M., 2008. Metin Madenciliği ile Soru Cevaplama Sistemi, ELECO-2008, 356-359.
3. Amasyalı, M.F., Diri, B., 2006. Automatic Turkish Text Categorization in Terms of Author, Genre and Gender. C. Kop et al. (Eds.): NLDB 2006, LNCS 3999, 221–226.
4. Yıldız, H.K., Gençtav, M., Usta N., Diri B., Amasyalı M.F., 2007. Metin Sınıflandırmada Yeni Özellik Çıkarımı, Signal Processing and Communications Applications (SIU 2007), Eskişehir, Turkey.
5. Cataltepe, Z., Turan, Y., Kesgin, F., 2007. Turkish Document Classification Using Shorter Roots, Signal Processing and Communications Applications (SIU 2007), Eskisehir, Turkey.
6. Güran, A., Akyokuş, S., Bayazıt, N.G., Gürbüz, M.Z., 2009. Turkish Text Categorization Using N-Gram Words. International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2009), Trabzon, Turkey.
7. Torunoğlu, D., Çakırman, E., Ganiz, M., Akyokuş, S., Gürbüz, Z., 2011. Analysis of Preprocessing Methods on Text Classification of Turkish Texts, International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2011), İstanbul, 112-117.
8. Uysal, K.U., Günal, S., 2013. The Impact of Preprocessing on Text Classification, Information Processing and Management, 104-112.
9. Amasyalı, M.F., Balcı, S., Varlı, E.N., Mete, E., 2012. Türkçe Metinlerin Sınıflandırılmasında Metin Temsil Yöntemlerinin Performans Karşılaştırılması, EMO Bilimsel Dergi.
10. Açıkalın, B., Beyazıt, N.G., 2016. The Importance of Preprocessing in Turkish Text Classification, Signal Processing and Communications Applications (SIU 2016), Zonguldak.
11. Parlar T., Özel S.A., 2018. An Investigation of Term Weighting and Feature Selection Methods for Sentiment Analysis, Majlesi Journal of Electrical Engineering, 12(2), 63-68.
12. Amasyalı, M.F., Beken, A., 2013. Türkçe Kelimelerin Anlamsal Benzerliklerinin Ölçülmesi ve Metin Sınıflandırmada Kullanılması, Signal Processing and Communications Applications (SIU 2009), Antalya, Turkey.
13. Amasyalı, M.F., Çetin, M., 2013. Eğiticili ve Geleneksel Terim Ağırlıklandırma Yöntemleriyle Duygu Analizi, Signal Processing and Communications Applications (SIU 2013), KKTC.
14. Ergin, S., Sora Gunal, E., Yigit, H., Aydin, R., 2012. Turkish Anti-spam Filtering Using Binary and Probabilistic Models, AWER Procedia Information Technology & Computer Science, 1, 1007-1012.
15. Yıldız Teknik Üniversitesi Kemik Grubu Veri Kümeleri, http://www.kemik.yildiz.edu.tr.
16. Akın, A.A., Akın, M.D., 2007. Zemberek, an Open Source NLP Framework for Turkish Languages, Structure, 10, 1-5.
17. Eryiğit, G., Adalı, E., 2004. An Affix Striping Morphological Analyzer for Turkish, International Conference Artificial Intelligence and Applications, Austria, 299-304.
18. Can, F., Koçberber, S., Balçık, E., Kaynak, C., Öcalan, H.Ç., Vursavaş, O.M., 2008. Information Retrieval on Turkish Texts, Journal of the American Society for Information Science and Technology, 59, 407-421.
19. Han J., Kamber M., Pei, J.P., 2012. Data Mining Concepts and Techniques, Elsevier, 740.
20. Leung, K.M., 2007. Naive Bayesian Classifier, Polytechnic University Department of Computer Science/Finance and Risk Engineering. Lecture Notes.
21. Tunalı, V., Bilgin, T.T., 2012. PRETO: A High-Performance Text Mining Tool for Preprocessing Turkish Texts, International Conference on Computer Systems and Technologies.
22. Weka data mining tool http://www.cs.waikato. ac.nz/ml/weka.

Nitelik Çıkarımı Yöntemlerinin Türkçe Metinlerin Sınıflandırılmasına Etkisi

Yıl 2019, Cilt: 34 Sayı: 3, 95 - 108, 30.09.2019

Özge Akdoğan Selma Ayşe Özel

https://doi.org/10.21605/cukurovaummfd.637643

Cited By: 2

https://izlik.org/JA32LS29SK

Öz

Nitelik çıkarımı metin sınıflamanın en önemli önişleme adımıdır. Önişleme tekniklerinin İngilizce metin sınıflandırma üzerindeki etkisi çok çalışılmış bir konu olmasına rağmen, Türkçe için bu konuda yapılmış çalışmalar oldukça sınırlı ve belirli bir problem alanına bağlıdır. Bu çalışmada nitelik çıkarımının haber sınıflama, spam e-posta tespiti, duygu analizi ve yazar tanımayı içeren dört farklı Türkçe metin sınıflandırma problemi üzerindeki etkisi araştırılmış ve problemler arasındaki benzerlik ve farklılıklar gözlenmiştir. Ayrıca yeni bir nitelik seçimi yöntemi önerilmiştir. Deneysel analizler sonucunda durak kelimelerin çıkarılmasının sınıflandırma performansını artırdığı görülmüştür. Ancak kelime köklerinin alınmasının sınıflandırma doğruluğu üzerinde olumlu bir etkisi gözlenmemiştir. En başarılı terim ağırlıklandırma yöntemlerinin tf ve tf*idf olduğu görülmüştür. Önerilen nitelik seçimi yöntemi sınıflandırma performansını iyileştirmiş ve sıklıkla kullanılan yöntemlerden daha yüksek doğruluk değerine sahip olmuştur.

Anahtar Kelimeler

Metin sınıflandırma , Önişleme yöntemleri , Nitelik çıkarımı , Türkçe metinler

Kaynakça

1. Hand, D., Mannila, H., Smyth, P., 2001. Principles of Data Mining, the MIT Press, England, 546.
2. İlhan, S., Duru, N., Karagöz, Ş., Sağır, M., 2008. Metin Madenciliği ile Soru Cevaplama Sistemi, ELECO-2008, 356-359.
3. Amasyalı, M.F., Diri, B., 2006. Automatic Turkish Text Categorization in Terms of Author, Genre and Gender. C. Kop et al. (Eds.): NLDB 2006, LNCS 3999, 221–226.
4. Yıldız, H.K., Gençtav, M., Usta N., Diri B., Amasyalı M.F., 2007. Metin Sınıflandırmada Yeni Özellik Çıkarımı, Signal Processing and Communications Applications (SIU 2007), Eskişehir, Turkey.
5. Cataltepe, Z., Turan, Y., Kesgin, F., 2007. Turkish Document Classification Using Shorter Roots, Signal Processing and Communications Applications (SIU 2007), Eskisehir, Turkey.
6. Güran, A., Akyokuş, S., Bayazıt, N.G., Gürbüz, M.Z., 2009. Turkish Text Categorization Using N-Gram Words. International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2009), Trabzon, Turkey.
7. Torunoğlu, D., Çakırman, E., Ganiz, M., Akyokuş, S., Gürbüz, Z., 2011. Analysis of Preprocessing Methods on Text Classification of Turkish Texts, International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2011), İstanbul, 112-117.
8. Uysal, K.U., Günal, S., 2013. The Impact of Preprocessing on Text Classification, Information Processing and Management, 104-112.
9. Amasyalı, M.F., Balcı, S., Varlı, E.N., Mete, E., 2012. Türkçe Metinlerin Sınıflandırılmasında Metin Temsil Yöntemlerinin Performans Karşılaştırılması, EMO Bilimsel Dergi.
10. Açıkalın, B., Beyazıt, N.G., 2016. The Importance of Preprocessing in Turkish Text Classification, Signal Processing and Communications Applications (SIU 2016), Zonguldak.
11. Parlar T., Özel S.A., 2018. An Investigation of Term Weighting and Feature Selection Methods for Sentiment Analysis, Majlesi Journal of Electrical Engineering, 12(2), 63-68.
12. Amasyalı, M.F., Beken, A., 2013. Türkçe Kelimelerin Anlamsal Benzerliklerinin Ölçülmesi ve Metin Sınıflandırmada Kullanılması, Signal Processing and Communications Applications (SIU 2009), Antalya, Turkey.
13. Amasyalı, M.F., Çetin, M., 2013. Eğiticili ve Geleneksel Terim Ağırlıklandırma Yöntemleriyle Duygu Analizi, Signal Processing and Communications Applications (SIU 2013), KKTC.
14. Ergin, S., Sora Gunal, E., Yigit, H., Aydin, R., 2012. Turkish Anti-spam Filtering Using Binary and Probabilistic Models, AWER Procedia Information Technology & Computer Science, 1, 1007-1012.
15. Yıldız Teknik Üniversitesi Kemik Grubu Veri Kümeleri, http://www.kemik.yildiz.edu.tr.
16. Akın, A.A., Akın, M.D., 2007. Zemberek, an Open Source NLP Framework for Turkish Languages, Structure, 10, 1-5.
17. Eryiğit, G., Adalı, E., 2004. An Affix Striping Morphological Analyzer for Turkish, International Conference Artificial Intelligence and Applications, Austria, 299-304.
18. Can, F., Koçberber, S., Balçık, E., Kaynak, C., Öcalan, H.Ç., Vursavaş, O.M., 2008. Information Retrieval on Turkish Texts, Journal of the American Society for Information Science and Technology, 59, 407-421.
19. Han J., Kamber M., Pei, J.P., 2012. Data Mining Concepts and Techniques, Elsevier, 740.
20. Leung, K.M., 2007. Naive Bayesian Classifier, Polytechnic University Department of Computer Science/Finance and Risk Engineering. Lecture Notes.
21. Tunalı, V., Bilgin, T.T., 2012. PRETO: A High-Performance Text Mining Tool for Preprocessing Turkish Texts, International Conference on Computer Systems and Technologies.
22. Weka data mining tool http://www.cs.waikato. ac.nz/ml/weka.

Toplam 22 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Bölüm	Araştırma Makalesi
Yazarlar	Özge Akdoğan Bu kişi benim Selma Ayşe Özel
Yayımlanma Tarihi	30 Eylül 2019
DOI	https://doi.org/10.21605/cukurovaummfd.637643
IZ	https://izlik.org/JA32LS29SK
Yayımlandığı Sayı	Yıl 2019 Cilt: 34 Sayı: 3

Kaynak Göster

APA	Akdoğan, Ö., & Özel, S. A. (2019). Effects of Feature Extraction Techniques on Classification of Turkish Texts. Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi, 34(3), 95-108. https://doi.org/10.21605/cukurovaummfd.637643
AMA	1.Akdoğan Ö, Özel SA. Effects of Feature Extraction Techniques on Classification of Turkish Texts. cukurovaummfd. 2019;34(3):95-108. doi:10.21605/cukurovaummfd.637643
Chicago	Akdoğan, Özge, ve Selma Ayşe Özel. 2019. “Effects of Feature Extraction Techniques on Classification of Turkish Texts”. Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi 34 (3): 95-108. https://doi.org/10.21605/cukurovaummfd.637643.
EndNote	Akdoğan Ö, Özel SA (01 Eylül 2019) Effects of Feature Extraction Techniques on Classification of Turkish Texts. Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi 34 3 95–108.
IEEE	[1]Ö. Akdoğan ve S. A. Özel, “Effects of Feature Extraction Techniques on Classification of Turkish Texts”, cukurovaummfd, c. 34, sy 3, ss. 95–108, Eyl. 2019, doi: 10.21605/cukurovaummfd.637643.
ISNAD	Akdoğan, Özge - Özel, Selma Ayşe. “Effects of Feature Extraction Techniques on Classification of Turkish Texts”. Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi 34/3 (01 Eylül 2019): 95-108. https://doi.org/10.21605/cukurovaummfd.637643.
JAMA	1.Akdoğan Ö, Özel SA. Effects of Feature Extraction Techniques on Classification of Turkish Texts. cukurovaummfd. 2019;34:95–108.
MLA	Akdoğan, Özge, ve Selma Ayşe Özel. “Effects of Feature Extraction Techniques on Classification of Turkish Texts”. Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi, c. 34, sy 3, Eylül 2019, ss. 95-108, doi:10.21605/cukurovaummfd.637643.
Vancouver	1.Özge Akdoğan, Selma Ayşe Özel. Effects of Feature Extraction Techniques on Classification of Turkish Texts. cukurovaummfd. 01 Eylül 2019;34(3):95-108. doi:10.21605/cukurovaummfd.637643

Cited By

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi

Effects of Feature Extraction Techniques on Classification of Turkish Texts

Öz

Anahtar Kelimeler

Kaynakça

Nitelik Çıkarımı Yöntemlerinin Türkçe Metinlerin Sınıflandırılmasına Etkisi

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Kaynak Göster

Cited By

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Automatika

Akın Özçift

https://doi.org/10.1080/00051144.2021.1922150

Analysis of whether news on the Internet is real or fake by using deep learning methods and the TF-IDF algorithm

International Advanced Researches and Engineering Journal

Tilbe KORKMAZ

https://doi.org/10.35860/iarej.779019