Research Article
BibTex RIS Cite

YouTube Yorumlarından Spam Tespitine Yönelik Makine Öğrenmesi ve Derin Öğrenme Yöntemlerinin Karşılaştırmalı Bir Analizi

Year 2025, Volume: 6 Issue: 1, 30 - 50, 25.06.2025
https://doi.org/10.5281/zenodo.15719417

Abstract

Spam içeriklerin sosyal medya platformlarındaki bilgi güvenliğini tehdit etmesi ve manuel tespit yöntemlerinin yetersiz kalması nedeniyle, otomatik spam tespit sistemlerinin geliştirilmesi büyük önem taşımaktadır. Makine öğrenmesi ve derin öğrenme teknikleri, spam yorumları yalnızca anahtar kelimelere dayanarak değil, bağlamsal ilişkileri ve dilin anlamını dikkate alarak sınıflandırmada büyük avantajlar sunmaktadır. Bu çalışmada, YouTube yorumlarında spam tespitini otomatik olarak gerçekleştirmek için farklı makine öğrenmesi ve derin öğrenme modellerinin karşılaştırmalı bir analizi sunulmuştur. Çalışmada, LR, RF, SVM, XGBoost ve Bi-LSTM kullanılarak spam yorumları tespit etmek için kapsamlı analizler yapılmıştır. TF-IDF vektörleştirme yöntemi kullanılarak metinler sayısal hale getirilmiş ve modellerin eğitimi için uygun bir veri temsili oluşturulmuştur. Deneysel sonuçlar, metin tabanlı verilerde uzun vadeli bağımlılıkları öğrenme yeteneği sayesinde Bi-LSTM'in %97,1 sınıflandırma doğruluyla karşılaştırılan modellerden daha başarılı olduğunu göstermiştir.

References

  • Susanto H, Fang Yie L, Mohiddin F, Rahman Setiawan A A, Haghi P K, Setiana D. Revealing social media phenomenon in time of COVID-19 pandemic for boosting start-up businesses through digital ecosystem. Applied system innovation. 2021;4(1).
  • Humprecht E, Kessler S H. Unveiling misinformation on YouTube: examining the content of COVID-19 vaccination misinformation videos in Switzerland. Frontiers in Communication. 2024; 9.
  • Lakshmi M S, Rani A S, Divya T S, Shravani J. Dynamic Spam Detection in Social Networks: Leveraging Convex Nonnegative Matrix Factorization for Enhanced Accuracy and Scalability. International Journal of Computer Engineering in Research Trends. 2024; 11(4), 1-11.
  • Gongane V U, Munot M V, Anuse A D. Detection and moderation of detrimental content on social media platforms: current status and future directions. Social Network Analysis and Mining. 2022; 12(1).
  • Wani M A, ElAffendi M, Shakil K A. AI-Generated Spam Review Detection Framework with Deep Learning Algorithms and Natural Language Processing. Computers. 2024; 13(10).
  • Ahmed N, Amin R, Aldabbas H, Koundal D, Alouffi B, Shah T. Machine learning techniques for spam detection in email and IoT platforms: analysis and research challenges. Security and Communication Networks. 2022; 2022(1).
  • Akinyelu A A. Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques. Journal of Computer Security. 2021; 29(5), 473-529.
  • Al Saidat M R, Yerima S Y, Shaalan K. Advancements of SMS Spam Detection: A Comprehensive Survey of NLP and ML Techniques. Procedia Computer Science. 2024; 244, 248-259.
  • Al-Adhaileh M H, Alsaade F W. Detecting and Analysing Fake Opinions Using Artificial Intelligence Algorithms. Intelligent Automation & Soft Computing. 2022; 32(1).
  • Shinde S A, Pawar R R, Jagtap A A, Tambewagh P A, Rajput P U, Mali M K, Mulik S V. Deceptive opinion spam detection using bidirectional long short-term memory with capsule neural network. Multimedia Tools and Applications. 2024; 83(15), 45111-45140.
  • Sinhal A, Maheshwari M. An Extensive Review on Contemporary Analysis of Comment Filtration of YouTube Videos Using Machine Learning Techniques. International Journal of Emerging Technology and Advanced Engineering. 2022; 12(9), 130-143.
  • Shirzadova S, Uysal A K. Türkçe YouTube Yorumları Üzerinde Spam Filtreleme. Düzce Üniversitesi Bilim ve Teknoloji Dergisi. 2022; 10(4), 1793-1810.
  • Baktır N, Atay Y. Makine Öğrenmesi Yaklaşımlarının Spam-Mail Sınıflandırma Probleminde Karşılaştırmalı Analizi. Bilişim Teknolojileri Dergisi. 2022; 15(3), 349-364.
  • Bakır R, Erbay H, Bakır H. ALBERT4Spam: a novel approach for spam detection on social networks. Bilişim Teknolojileri Dergisi. 2024; 17(2), 81-94.
  • Güven Z A. Türkçe e-postalarda spam tespiti için makine öğrenme yöntemlerinin ve dil modellerinin analizi. Avrupa Bilim ve Teknoloji Dergisi 2023; 47, 1-6.
  • Şengel Ö. A comparative analysis of learning techniques in the context of Turkish spam detection. Batman Üniversitesi Yaşam Bilimleri Dergisi. 2024; 14(1), 43-56.
  • Sam’an M, Imaddudin K. Hybrid deep learning model for YouTube spam comment detection. International Journal of Electrical and Computer Engineering (IJECE). 2024; 14(3), 3313-3319.
  • Airlangga G. Spam Detection in YouTube Comments Using Deep Learning Models: A Comparative Study of MLP, CNN, LSTM, BiLSTM, GRU, and Attention Mechanisms. MALCOM: Indonesian Journal of Machine Learning and Computer Science. 2024; 4(4), 1533-1538.
  • Waheed A. YouTube Yorumları Spam Veri Kümesi [İnternet]. Kaggle; [alıntılanma tarihi 6 Mart 2025]. Erişim adresi: https://www.kaggle.com/datasets/ahsenwaheed/youtube-comments-spam-dataset/data
  • Bektaş J. EKSL: An effective novel dynamic ensemble model for unbalanced datasets based on LR and SVM hyperplane-distances. Information Sciences. 2022; 597, 182-192.
  • Al-Najjar H A, Pradhan B, Kalantar B, Sameen M I, Santosh M, Alamri A. Landslide susceptibility modeling: an integrated novel method based on machine learning feature transformation. Remote Sensing. 2021; 13(16).
  • Kim G, Yang S M, Kim D M, Choi J G, Lim S, Park H W. Developing a deep learning-based uncertainty-aware tool wear prediction method using smartphone sensors for the turning process of Ti-6Al-4V. Journal of Manufacturing Systems. 2024; 76, 133-157.
  • Nwosu A, Aimufua G I O, Ajayi B A, Olalere M. The Impact of Regularization on Linear Regression Based Model. Journal of Artificial Intelligence and Computer Science. 2024; 1(1).
  • Arabameri A, Chandra Pal S, Rezaie F, Chakrabortty R, Saha A, Blaschke T, Thi Ngo P T. Decision tree based ensemble machine learning approaches for landslide susceptibility mapping. Geocarto International. 2022; 37(16), 4594-4627.
  • Sesa O, Haikal A Y, Elhosseini M A, Gad H H. Smart Bagged Tree-based Classifier optimized by Random Forests (SBT-RF) to Classify Brain-Machine Interface Data. International journal of electrical and computer engineering systems. 2022; 13(10), 895-908.
  • Jagannath A, Jagannath J, Kumar P S P V. A comprehensive survey on radio frequency (RF) fingerprinting: Traditional approaches, deep learning, and open challenges. Computer Networks. 2022; 219.
  • Chandra M A, Bedi S S. Survey on SVM and their application in image classification. International Journal of Information Technology. 2021; 13(5), 1-11.
  • Lai Z, Chen X, Zhang J, Kong H, Wen J. Maximal margin support vector machine for feature representation and classification. IEEE Transactions on Cybernetics. 2023; 53(10), 6700-6713.
  • Negi H S, Dimri S C, Kumar B, Ram M. Support vector machine and classification, kernel trick for separating of data points. Mathematics in Engineering, Science & Aerospace (MESA). 2024; 15(2).
  • Ding X, Liu J, Yang F, Cao J. Random radial basis function kernel-based support vector machine. Journal of the Franklin Institute. 2021; 358(18), 10121-10140.
  • Natras R, Soja B, Schmidt M. Ensemble machine learning of random forest, AdaBoost and XGBoost for vertical total electron content forecasting. Remote Sensing. 2022; 14(15), 3547.
  • Demir S, Sahin E K. An investigation of feature selection methods for soil liquefaction prediction based on tree-based ensemble algorithms using AdaBoost, gradient boosting, and XGBoost. Neural Computing and Applications. 2023; 35(4), 3173-3190.
  • Ji S, Wang X, Lyu T, Liu X, Wang Y, Heinen E, Sun Z. Understanding cycling distance according to the prediction of the XGBoost and the interpretation of SHAP: A non-linear and interaction effect analysis. Journal of Transport Geography. 2022; 103.
  • Shoubaki H, Abdallah S, Shaalan K. Deep Learning Techniques for Identifying Poets in Arabic Poetry: A Focus on LSTM and Bi-LSTM. Procedia Computer Science. 2024; 244, 461-470.
  • Zhou Z G. Research on sentiment analysis model of short text based on deep learning. Scientific Programming. 2022; 2022(1), 2681533.
  • Ahmed S, Saif A S, Hanif M I, Shakil M M N, Jaman M M, Haque M M U, Sabbir H M. Att-BiL-SL: Attention-based Bi-LSTM and sequential LSTM for describing video in the textual formation. Applied sciences. 2021; 12(1), 317.
  • Odera D, Odiaga G. A comparative analysis of recurrent neural network and support vector machine for binary classification of spam short message service. World Journal of Advanced Engineering Technology and Sciences. 2023; 9(1), 127-152.
  • Zhou C, Li Q, Li C, Yu J, Liu Y, Wang G, Sun L. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. International Journal of Machine Learning and Cybernetics, 2024, 1-65.
  • Gao Z, Feng A, Song X, Wu X. Target-dependent sentiment classification with BERT. Ieee Access, 2019; 7, 154290-154299.
  • Rosso M M, Marasco G, Aiello S, Aloisio A, Chiaia B, Marano G C. Convolutional networks and transformers for intelligent road tunnel investigations. Computers & Structures, 2023; 275.

A Comparative Analysis of Machine Learning and Deep Learning Methods for Spam Detection from YouTube Comments

Year 2025, Volume: 6 Issue: 1, 30 - 50, 25.06.2025
https://doi.org/10.5281/zenodo.15719417

Abstract

Since spam content threatens information security on social media platforms and manual detection methods are inadequate, the development of automatic spam detection systems is of great importance. Machine learning and deep learning techniques offer great advantages in classifying spam comments not only based on keywords but also by taking into account contextual relationships and language meaning. In this study, a comparative analysis of different machine learning and deep learning models is presented to automatically perform spam detection in YouTube comments. In the study, comprehensive analyses were performed to detect spam comments using LR, RF, SVM, XGBoost and Bi-LSTM. The texts were digitized using the TF-IDF vectorization method and a suitable data representation was created for training the models. Experimental results showed that Bi-LSTM outperformed the compared models with 97.1% classification accuracy thanks to its ability to learn long-term dependencies in text-based data.

References

  • Susanto H, Fang Yie L, Mohiddin F, Rahman Setiawan A A, Haghi P K, Setiana D. Revealing social media phenomenon in time of COVID-19 pandemic for boosting start-up businesses through digital ecosystem. Applied system innovation. 2021;4(1).
  • Humprecht E, Kessler S H. Unveiling misinformation on YouTube: examining the content of COVID-19 vaccination misinformation videos in Switzerland. Frontiers in Communication. 2024; 9.
  • Lakshmi M S, Rani A S, Divya T S, Shravani J. Dynamic Spam Detection in Social Networks: Leveraging Convex Nonnegative Matrix Factorization for Enhanced Accuracy and Scalability. International Journal of Computer Engineering in Research Trends. 2024; 11(4), 1-11.
  • Gongane V U, Munot M V, Anuse A D. Detection and moderation of detrimental content on social media platforms: current status and future directions. Social Network Analysis and Mining. 2022; 12(1).
  • Wani M A, ElAffendi M, Shakil K A. AI-Generated Spam Review Detection Framework with Deep Learning Algorithms and Natural Language Processing. Computers. 2024; 13(10).
  • Ahmed N, Amin R, Aldabbas H, Koundal D, Alouffi B, Shah T. Machine learning techniques for spam detection in email and IoT platforms: analysis and research challenges. Security and Communication Networks. 2022; 2022(1).
  • Akinyelu A A. Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques. Journal of Computer Security. 2021; 29(5), 473-529.
  • Al Saidat M R, Yerima S Y, Shaalan K. Advancements of SMS Spam Detection: A Comprehensive Survey of NLP and ML Techniques. Procedia Computer Science. 2024; 244, 248-259.
  • Al-Adhaileh M H, Alsaade F W. Detecting and Analysing Fake Opinions Using Artificial Intelligence Algorithms. Intelligent Automation & Soft Computing. 2022; 32(1).
  • Shinde S A, Pawar R R, Jagtap A A, Tambewagh P A, Rajput P U, Mali M K, Mulik S V. Deceptive opinion spam detection using bidirectional long short-term memory with capsule neural network. Multimedia Tools and Applications. 2024; 83(15), 45111-45140.
  • Sinhal A, Maheshwari M. An Extensive Review on Contemporary Analysis of Comment Filtration of YouTube Videos Using Machine Learning Techniques. International Journal of Emerging Technology and Advanced Engineering. 2022; 12(9), 130-143.
  • Shirzadova S, Uysal A K. Türkçe YouTube Yorumları Üzerinde Spam Filtreleme. Düzce Üniversitesi Bilim ve Teknoloji Dergisi. 2022; 10(4), 1793-1810.
  • Baktır N, Atay Y. Makine Öğrenmesi Yaklaşımlarının Spam-Mail Sınıflandırma Probleminde Karşılaştırmalı Analizi. Bilişim Teknolojileri Dergisi. 2022; 15(3), 349-364.
  • Bakır R, Erbay H, Bakır H. ALBERT4Spam: a novel approach for spam detection on social networks. Bilişim Teknolojileri Dergisi. 2024; 17(2), 81-94.
  • Güven Z A. Türkçe e-postalarda spam tespiti için makine öğrenme yöntemlerinin ve dil modellerinin analizi. Avrupa Bilim ve Teknoloji Dergisi 2023; 47, 1-6.
  • Şengel Ö. A comparative analysis of learning techniques in the context of Turkish spam detection. Batman Üniversitesi Yaşam Bilimleri Dergisi. 2024; 14(1), 43-56.
  • Sam’an M, Imaddudin K. Hybrid deep learning model for YouTube spam comment detection. International Journal of Electrical and Computer Engineering (IJECE). 2024; 14(3), 3313-3319.
  • Airlangga G. Spam Detection in YouTube Comments Using Deep Learning Models: A Comparative Study of MLP, CNN, LSTM, BiLSTM, GRU, and Attention Mechanisms. MALCOM: Indonesian Journal of Machine Learning and Computer Science. 2024; 4(4), 1533-1538.
  • Waheed A. YouTube Yorumları Spam Veri Kümesi [İnternet]. Kaggle; [alıntılanma tarihi 6 Mart 2025]. Erişim adresi: https://www.kaggle.com/datasets/ahsenwaheed/youtube-comments-spam-dataset/data
  • Bektaş J. EKSL: An effective novel dynamic ensemble model for unbalanced datasets based on LR and SVM hyperplane-distances. Information Sciences. 2022; 597, 182-192.
  • Al-Najjar H A, Pradhan B, Kalantar B, Sameen M I, Santosh M, Alamri A. Landslide susceptibility modeling: an integrated novel method based on machine learning feature transformation. Remote Sensing. 2021; 13(16).
  • Kim G, Yang S M, Kim D M, Choi J G, Lim S, Park H W. Developing a deep learning-based uncertainty-aware tool wear prediction method using smartphone sensors for the turning process of Ti-6Al-4V. Journal of Manufacturing Systems. 2024; 76, 133-157.
  • Nwosu A, Aimufua G I O, Ajayi B A, Olalere M. The Impact of Regularization on Linear Regression Based Model. Journal of Artificial Intelligence and Computer Science. 2024; 1(1).
  • Arabameri A, Chandra Pal S, Rezaie F, Chakrabortty R, Saha A, Blaschke T, Thi Ngo P T. Decision tree based ensemble machine learning approaches for landslide susceptibility mapping. Geocarto International. 2022; 37(16), 4594-4627.
  • Sesa O, Haikal A Y, Elhosseini M A, Gad H H. Smart Bagged Tree-based Classifier optimized by Random Forests (SBT-RF) to Classify Brain-Machine Interface Data. International journal of electrical and computer engineering systems. 2022; 13(10), 895-908.
  • Jagannath A, Jagannath J, Kumar P S P V. A comprehensive survey on radio frequency (RF) fingerprinting: Traditional approaches, deep learning, and open challenges. Computer Networks. 2022; 219.
  • Chandra M A, Bedi S S. Survey on SVM and their application in image classification. International Journal of Information Technology. 2021; 13(5), 1-11.
  • Lai Z, Chen X, Zhang J, Kong H, Wen J. Maximal margin support vector machine for feature representation and classification. IEEE Transactions on Cybernetics. 2023; 53(10), 6700-6713.
  • Negi H S, Dimri S C, Kumar B, Ram M. Support vector machine and classification, kernel trick for separating of data points. Mathematics in Engineering, Science & Aerospace (MESA). 2024; 15(2).
  • Ding X, Liu J, Yang F, Cao J. Random radial basis function kernel-based support vector machine. Journal of the Franklin Institute. 2021; 358(18), 10121-10140.
  • Natras R, Soja B, Schmidt M. Ensemble machine learning of random forest, AdaBoost and XGBoost for vertical total electron content forecasting. Remote Sensing. 2022; 14(15), 3547.
  • Demir S, Sahin E K. An investigation of feature selection methods for soil liquefaction prediction based on tree-based ensemble algorithms using AdaBoost, gradient boosting, and XGBoost. Neural Computing and Applications. 2023; 35(4), 3173-3190.
  • Ji S, Wang X, Lyu T, Liu X, Wang Y, Heinen E, Sun Z. Understanding cycling distance according to the prediction of the XGBoost and the interpretation of SHAP: A non-linear and interaction effect analysis. Journal of Transport Geography. 2022; 103.
  • Shoubaki H, Abdallah S, Shaalan K. Deep Learning Techniques for Identifying Poets in Arabic Poetry: A Focus on LSTM and Bi-LSTM. Procedia Computer Science. 2024; 244, 461-470.
  • Zhou Z G. Research on sentiment analysis model of short text based on deep learning. Scientific Programming. 2022; 2022(1), 2681533.
  • Ahmed S, Saif A S, Hanif M I, Shakil M M N, Jaman M M, Haque M M U, Sabbir H M. Att-BiL-SL: Attention-based Bi-LSTM and sequential LSTM for describing video in the textual formation. Applied sciences. 2021; 12(1), 317.
  • Odera D, Odiaga G. A comparative analysis of recurrent neural network and support vector machine for binary classification of spam short message service. World Journal of Advanced Engineering Technology and Sciences. 2023; 9(1), 127-152.
  • Zhou C, Li Q, Li C, Yu J, Liu Y, Wang G, Sun L. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. International Journal of Machine Learning and Cybernetics, 2024, 1-65.
  • Gao Z, Feng A, Song X, Wu X. Target-dependent sentiment classification with BERT. Ieee Access, 2019; 7, 154290-154299.
  • Rosso M M, Marasco G, Aiello S, Aloisio A, Chiaia B, Marano G C. Convolutional networks and transformers for intelligent road tunnel investigations. Computers & Structures, 2023; 275.
There are 40 citations in total.

Details

Primary Language Turkish
Subjects Natural Language Processing
Journal Section Research Article
Authors

Anıl Utku 0000-0002-7240-8713

Submission Date March 6, 2025
Acceptance Date May 29, 2025
Publication Date June 25, 2025
Published in Issue Year 2025 Volume: 6 Issue: 1

Cite

Vancouver Utku A. YouTube Yorumlarından Spam Tespitine Yönelik Makine Öğrenmesi ve Derin Öğrenme Yöntemlerinin Karşılaştırmalı Bir Analizi. BUTS. 2025;6(1):30-5.
This journal is prepared and published by the Bingöl University Technical Sciences journal team.