Research Article
BibTex RIS Cite

Spam Filtering on Turkish YouTube Comments

Year 2022, Volume: 10 Issue: 4, 1793 - 1810, 25.10.2022

Abstract

One of the most preferred platforms by social media users is YouTube. The increase in the use of YouTube has brought some problems with it. Unwanted (spam) comments, which are generally unrelated to the shared video content, for advertising purposes and constantly repetitive, cause useless resource use. In this study, it is aimed to automatically detect unwanted comments on YouTube comments. Although some systems have been developed in other languages to solve text classification problems, studies for Turkish are very limited. In this study, datasets consisting of Turkish YouTube comments were created and the performances of automatic text classification algorithms on the datasets were evaluated. An important contribution of this study is the creation of 5 Turkish datasets that will be public for use in future academic studies. In the study, the performances of classification algorithms that give good results in terms of accuracy and speed were compared using the Weka data mining tool. In terms of accuracy values, the SMO machine learning algorithm seems to be more successful than the others on the classification problem of Turkish YouTube comments. In addition, the effect of feature selection on classification performance has been investigated and it has been observed that it generally leads to slight improvements in classification accuracy

References

  • [1] R. Dolan, J. Conduit R., J. Fahy, S. Goodman, “Social media: communication strategies, engagement and future research directions”, International Journal of Wine Business Research, vol. 29, no. 1, pp. 1-19, 2017.
  • [2] H. Bayrak. (2020, 23 Şubat). Türkiye İnternet Kullanımı ve Sosyal Medya İstatistikleri. [Online]. Erişim: https://dijilopedi.com/2020-turkiye-internet-kullanimi-ve-sosyal-medya-istatistikleri/.
  • [3] P. de Bérail, M. Guillon, C. Bungener, “The relations between YouTube addiction, social anxiety and parasocial relationships with YouTubers: A moderated-mediation model based on a cognitive-behavioral framework”, Computers in Human Behavior, vol. 99, pp. 190-204, 2019.
  • [4] T. Singh, M. Kumari, S. Mahajan, “Feature oriented fuzzy logic based web spam detection”, Journal of Information and Optimization Sciences, vol. 38, no. 6, pp. 999-1015, 2017.
  • [5] C. Romero, M.G. Valdez, A. Alanis, “A comparative study of machine learning techniques in blog comments spam filtering”, The 2010 International Joint Conference on Neural Networks (IJCNN), 2010, pp. 1-7.
  • [6] W. Yang, L. Kwok, “Improving blog spam filters via machine learning”, International Journal of Data Analysis Techniques and Strategies, vol. 9, no. 2, pp. 99-121, 2017.
  • [7] Z. Li, H. Shen, “Soap: A social network aided personalized and effective spam filter to clean your e-mail box”, IEEE INFOCOM, 2011, pp. 1835-1843.
  • [8] G. Sanghani, K. Kotecha, “Incremental personalized E-mail spam filter using novel TFDCR feature selection with dynamic feature update”, Expert Systems with Applications, vol. 115, pp. 287-299, 2019.
  • [9] B. K. Dedeturk, B. Akay, “Spam filtering using a logistic regression model trained by an artificial bee colony algorithm”, Applied Soft Computing, vol. 91, pp. 1-18, 2020.
  • [10] C. Wang, Q. Li, T. Y. Ren, X. H. Wang, G. X. Guo, “High Efficiency Spam Filtering: A Manifold Learning-Based Approach”, Mathematical Problems in Engineering, vol. 2021, pp. 1-7, 2021.
  • [11] J.M.G. Hidalgo, T.A. Almeida, A. Yamakami, “On the validity of a new SMS spam collection”, 11th International Conference on Machine Learning and Applications, 2012, pp. 240-245.
  • [12] Ö. Örnek, “Orange 3 ile Türkçe ve İngilizce SMS Mesajlarında Spam Tespiti”, Eskişehir Türk Dünyası Uygulama ve Araştırma Merkezi Bilişim Dergisi, c. 1, s. 1., ss. 1-4, 2019.
  • [13] P. K. Roy, J. P. Singh, S. Banerjee, “Deep learning to filter SMS Spam”, Future Generation Computer Systems, vol. 102, pp. 524-533, 2020.
  • [14] V. Chaudhary, A. Sureka, “Contextual feature based one-class classifier approach for detecting video response spam on youtube”, Eleventh Annual Conference on Privacy, Security and Trust, 2013, pp. 195-204.
  • [15] T. Abd, H. Altabrawee, S.Q. Ajmi, “YouTube Spam Comments Detection Using Artificial Neural Network”, Journal of Engineering and Applied Sciences, vol. 13, no. 22, pp. 9638-9642, 2018.
  • [16] T. C. Alberto, J.V. Lochter, T.A. Almeida, “Tubespam: Comment spam filtering on YouTube”. Machine Learning and Applications (ICMLA), 2015, pp. 138-143.
  • [17] N.A.M. Samsudin, C. F. B. M. Foozy, N. Alias, P. Shamala, N. F. Othman, W. I. S. W. Din, “Youtube spam detection framework using naïve bayes and logistic regression”, Indonesian Journal of Electrical Engineering and Computer Science, vol. 14, no. 3, pp. 1508-1517, 2019.
  • [18] A.K. Uysal, “Feature selection for comment spam filtering on YouTube”, Data Science and Applications, vol. 1, no. 1, pp. 4-8, 2018.
  • [19] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, “The WEKA data mining software: an update”, ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10-18, 2009.
  • [20] B. Oralhan, “Veri madenciliği yaklaşımı ile telekomunikasyon sektöründe arıza giderme analizi”, Business & Management Studies: An International Journal , vol. 8, no. 1, pp. 1026-1043, 2020.
  • [21] M. Amrehn, F. Mualla, E. Angelopoulou, S. Steidl, A. Maier. (2018, 19 Aralık). The random forest classifier in WEKA: Discussion and new developments for imbalanced data. [Online]. Erişim: https://arxiv.org/abs/1812.08102v1.
  • [22] S. Kalmegh, “Analysis of weka data mining algorithm reptree, simple cart and randomtree for classification of indian news”, International Journal of Innovative Science, Engineering & Technology, vol. 2, no. 2, pp. 438-446, 2015.
  • [23] S. Kalmegh, M.S. Ghogare, “Performance comparison of rule based classifier: Jrip and decisiontable using weka data mining tool on car reviews”, International Engineering Journal For Research & Development, vol. 4, no. 5, pp. 5-5, 2019.
  • [24] J. Brownlee, “Machine learning mastery with Weka”, E-book, vol.1, 2019.
  • [25] A. Nakra, M. Duhan, “Comparative Analysis of Bayes Net Classifier, Naive Bayes Classifier and Combination of both Classifiers using WEKA”, IJ Inf. Technol. Comput. Sci, vol. 3, pp. 38-45, 2019.
  • [26] A. Gümüşçü, İ.B. Aydilek, R. Taşaltın, “Mikro-dizilim Veri Sınıflandırmasında Öznitelik Seçme Algoritmalarının Karşılaştırılması”, Harran Üniversitesi Mühendislik Dergisi, c. 1, s. 1, ss. 1-7, 2016.
  • [27] M. Z. Alam, M.S. Rahman, M.S. Rahman, “A Random Forest based predictor for medical data classification using feature ranking”, Informatics in Medicine Unlocked, vol. 15, pp. 1-12, 2019.
  • [28] H. Schütze, C. D. Manning, P. Raghavan, Introduction to information retrieval, Cambridge: Cambridge University Press, c. 39, ss. 234-265, 2018.
  • [29] A. A. Akın, M. D. Akın, “Zemberek, an open source NLP framework for Turkic Languages”, Structure, 2007, pp. 1-5.

Türkçe YouTube Yorumları Üzerinde Spam Filtreleme

Year 2022, Volume: 10 Issue: 4, 1793 - 1810, 25.10.2022

Abstract

Sosyal medya kullanıcıları tarafından en çok tercih edilen platformlardan birisi YouTube’tur. YouTube kullanımının artması beraberinde bazı problemleri de getirmiştir. Genellikle paylaşılan video içerikleriyle alakası olmayan, reklam amaçlı ve sürekli tekrarlayan istenmeyen (spam) yorumlar boşuna kaynak kullanımına sebep olmaktadır. Bu çalışmada, YouTube yorumları üzerinde istenmeyen yorumların otomatik tespit edilmesi amaçlanmaktadır. Metin sınıflandırma problemlerinin çözümü için diğer dillerde gerekli sistemler geliştirilse de Türkçe için yapılan çalışmalar oldukça sınırlıdır. Bu çalışmada Türkçe YouTube yorumlarından oluşan veri setleri oluşturulmuş ve veri setleri üzerinde otomatik metin sınıflandırma algoritmalarının performansları değerlendirilmiştir. Bu çalışmanın önemli bir katkısı da gelecek akademik çalışmalarda kullanılmak üzere erişime açık olacak 5 adet Türkçe veri seti oluşturulmuş olmasıdır. Çalışmada, Weka veri madenciliği aracı kullanılarak doğruluk ve hız açısından iyi sonuçlar veren sınıflandırma algoritmalarının performansları karşılaştırılmıştır. Doğruluk değerleri açısından bakıldığında SMO makine öğrenimi algoritması Türkçe YouTube yorumları sınıflandırma problemi üzerinde diğerlerine göre daha başarılı olarak görünmektedir. Bunun yanısıra öznitelik seçiminin sınıflandırma performansına etkisi araştırılmış ve genellikle az miktarda da olsa sınıflandırma doğruluk değerlerinde iyileşmelere sebep olduğu görülmüştür.

References

  • [1] R. Dolan, J. Conduit R., J. Fahy, S. Goodman, “Social media: communication strategies, engagement and future research directions”, International Journal of Wine Business Research, vol. 29, no. 1, pp. 1-19, 2017.
  • [2] H. Bayrak. (2020, 23 Şubat). Türkiye İnternet Kullanımı ve Sosyal Medya İstatistikleri. [Online]. Erişim: https://dijilopedi.com/2020-turkiye-internet-kullanimi-ve-sosyal-medya-istatistikleri/.
  • [3] P. de Bérail, M. Guillon, C. Bungener, “The relations between YouTube addiction, social anxiety and parasocial relationships with YouTubers: A moderated-mediation model based on a cognitive-behavioral framework”, Computers in Human Behavior, vol. 99, pp. 190-204, 2019.
  • [4] T. Singh, M. Kumari, S. Mahajan, “Feature oriented fuzzy logic based web spam detection”, Journal of Information and Optimization Sciences, vol. 38, no. 6, pp. 999-1015, 2017.
  • [5] C. Romero, M.G. Valdez, A. Alanis, “A comparative study of machine learning techniques in blog comments spam filtering”, The 2010 International Joint Conference on Neural Networks (IJCNN), 2010, pp. 1-7.
  • [6] W. Yang, L. Kwok, “Improving blog spam filters via machine learning”, International Journal of Data Analysis Techniques and Strategies, vol. 9, no. 2, pp. 99-121, 2017.
  • [7] Z. Li, H. Shen, “Soap: A social network aided personalized and effective spam filter to clean your e-mail box”, IEEE INFOCOM, 2011, pp. 1835-1843.
  • [8] G. Sanghani, K. Kotecha, “Incremental personalized E-mail spam filter using novel TFDCR feature selection with dynamic feature update”, Expert Systems with Applications, vol. 115, pp. 287-299, 2019.
  • [9] B. K. Dedeturk, B. Akay, “Spam filtering using a logistic regression model trained by an artificial bee colony algorithm”, Applied Soft Computing, vol. 91, pp. 1-18, 2020.
  • [10] C. Wang, Q. Li, T. Y. Ren, X. H. Wang, G. X. Guo, “High Efficiency Spam Filtering: A Manifold Learning-Based Approach”, Mathematical Problems in Engineering, vol. 2021, pp. 1-7, 2021.
  • [11] J.M.G. Hidalgo, T.A. Almeida, A. Yamakami, “On the validity of a new SMS spam collection”, 11th International Conference on Machine Learning and Applications, 2012, pp. 240-245.
  • [12] Ö. Örnek, “Orange 3 ile Türkçe ve İngilizce SMS Mesajlarında Spam Tespiti”, Eskişehir Türk Dünyası Uygulama ve Araştırma Merkezi Bilişim Dergisi, c. 1, s. 1., ss. 1-4, 2019.
  • [13] P. K. Roy, J. P. Singh, S. Banerjee, “Deep learning to filter SMS Spam”, Future Generation Computer Systems, vol. 102, pp. 524-533, 2020.
  • [14] V. Chaudhary, A. Sureka, “Contextual feature based one-class classifier approach for detecting video response spam on youtube”, Eleventh Annual Conference on Privacy, Security and Trust, 2013, pp. 195-204.
  • [15] T. Abd, H. Altabrawee, S.Q. Ajmi, “YouTube Spam Comments Detection Using Artificial Neural Network”, Journal of Engineering and Applied Sciences, vol. 13, no. 22, pp. 9638-9642, 2018.
  • [16] T. C. Alberto, J.V. Lochter, T.A. Almeida, “Tubespam: Comment spam filtering on YouTube”. Machine Learning and Applications (ICMLA), 2015, pp. 138-143.
  • [17] N.A.M. Samsudin, C. F. B. M. Foozy, N. Alias, P. Shamala, N. F. Othman, W. I. S. W. Din, “Youtube spam detection framework using naïve bayes and logistic regression”, Indonesian Journal of Electrical Engineering and Computer Science, vol. 14, no. 3, pp. 1508-1517, 2019.
  • [18] A.K. Uysal, “Feature selection for comment spam filtering on YouTube”, Data Science and Applications, vol. 1, no. 1, pp. 4-8, 2018.
  • [19] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, “The WEKA data mining software: an update”, ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10-18, 2009.
  • [20] B. Oralhan, “Veri madenciliği yaklaşımı ile telekomunikasyon sektöründe arıza giderme analizi”, Business & Management Studies: An International Journal , vol. 8, no. 1, pp. 1026-1043, 2020.
  • [21] M. Amrehn, F. Mualla, E. Angelopoulou, S. Steidl, A. Maier. (2018, 19 Aralık). The random forest classifier in WEKA: Discussion and new developments for imbalanced data. [Online]. Erişim: https://arxiv.org/abs/1812.08102v1.
  • [22] S. Kalmegh, “Analysis of weka data mining algorithm reptree, simple cart and randomtree for classification of indian news”, International Journal of Innovative Science, Engineering & Technology, vol. 2, no. 2, pp. 438-446, 2015.
  • [23] S. Kalmegh, M.S. Ghogare, “Performance comparison of rule based classifier: Jrip and decisiontable using weka data mining tool on car reviews”, International Engineering Journal For Research & Development, vol. 4, no. 5, pp. 5-5, 2019.
  • [24] J. Brownlee, “Machine learning mastery with Weka”, E-book, vol.1, 2019.
  • [25] A. Nakra, M. Duhan, “Comparative Analysis of Bayes Net Classifier, Naive Bayes Classifier and Combination of both Classifiers using WEKA”, IJ Inf. Technol. Comput. Sci, vol. 3, pp. 38-45, 2019.
  • [26] A. Gümüşçü, İ.B. Aydilek, R. Taşaltın, “Mikro-dizilim Veri Sınıflandırmasında Öznitelik Seçme Algoritmalarının Karşılaştırılması”, Harran Üniversitesi Mühendislik Dergisi, c. 1, s. 1, ss. 1-7, 2016.
  • [27] M. Z. Alam, M.S. Rahman, M.S. Rahman, “A Random Forest based predictor for medical data classification using feature ranking”, Informatics in Medicine Unlocked, vol. 15, pp. 1-12, 2019.
  • [28] H. Schütze, C. D. Manning, P. Raghavan, Introduction to information retrieval, Cambridge: Cambridge University Press, c. 39, ss. 234-265, 2018.
  • [29] A. A. Akın, M. D. Akın, “Zemberek, an open source NLP framework for Turkic Languages”, Structure, 2007, pp. 1-5.
There are 29 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Sevinj Shirzadova This is me 0000-0002-5819-9599

Alper Kürşat Uysal 0000-0002-4057-934X

Publication Date October 25, 2022
Published in Issue Year 2022 Volume: 10 Issue: 4

Cite

APA Shirzadova, S., & Uysal, A. K. (2022). Türkçe YouTube Yorumları Üzerinde Spam Filtreleme. Duzce University Journal of Science and Technology, 10(4), 1793-1810. https://doi.org/10.29130/dubited.974309
AMA Shirzadova S, Uysal AK. Türkçe YouTube Yorumları Üzerinde Spam Filtreleme. DUBİTED. October 2022;10(4):1793-1810. doi:10.29130/dubited.974309
Chicago Shirzadova, Sevinj, and Alper Kürşat Uysal. “Türkçe YouTube Yorumları Üzerinde Spam Filtreleme”. Duzce University Journal of Science and Technology 10, no. 4 (October 2022): 1793-1810. https://doi.org/10.29130/dubited.974309.
EndNote Shirzadova S, Uysal AK (October 1, 2022) Türkçe YouTube Yorumları Üzerinde Spam Filtreleme. Duzce University Journal of Science and Technology 10 4 1793–1810.
IEEE S. Shirzadova and A. K. Uysal, “Türkçe YouTube Yorumları Üzerinde Spam Filtreleme”, DUBİTED, vol. 10, no. 4, pp. 1793–1810, 2022, doi: 10.29130/dubited.974309.
ISNAD Shirzadova, Sevinj - Uysal, Alper Kürşat. “Türkçe YouTube Yorumları Üzerinde Spam Filtreleme”. Duzce University Journal of Science and Technology 10/4 (October 2022), 1793-1810. https://doi.org/10.29130/dubited.974309.
JAMA Shirzadova S, Uysal AK. Türkçe YouTube Yorumları Üzerinde Spam Filtreleme. DUBİTED. 2022;10:1793–1810.
MLA Shirzadova, Sevinj and Alper Kürşat Uysal. “Türkçe YouTube Yorumları Üzerinde Spam Filtreleme”. Duzce University Journal of Science and Technology, vol. 10, no. 4, 2022, pp. 1793-10, doi:10.29130/dubited.974309.
Vancouver Shirzadova S, Uysal AK. Türkçe YouTube Yorumları Üzerinde Spam Filtreleme. DUBİTED. 2022;10(4):1793-810.