Feature Selection for Comment Spam Filtering on YouTube

Alper Kürşat Uysal

Araştırma Makalesi

Feature Selection for Comment Spam Filtering on YouTube

Yıl 2018, Cilt: 1 Sayı: 1, 4 - 8, 27.12.2018

Alper Kürşat Uysal

Öz

Spam filtering is one of the most popular domains for text classification. While there exist some many studies on classification of spam e-mails and short text messages, comment spam filtering on YouTube is relatively a new topic as there are limited number of annotated datasets. As it is valid for all text classification problems, feature space’s high dimensionality is one of the biggest problems for spam filtering due to accuracy considerations. The contribution of this study is the analysis of the performance of five state-of-the-art text feature selection methods for spam filtering on YouTube using two widely-known classifiers namely naïve Bayes (NB) and decision tree (DT). Five datasets including spam comments belonging to different subjects were utilized in the experiments. These datasets are named as Psy, KatyPerry, LMFAO, Eminem, and Shakira. For evaluation, Macro-F1 success measure were used. Also, 3-fold cross-validation is preferred for a fair performance evaluation. Experiments indicated that distinguishing feature selector (DFS) and Gini Index (GI) methods are superior to the other three feature selection methods for spam filtering on YouTube. However, the performance of DT classifier is better than NB classifier in most cases for spam filtering on YouTube.

Anahtar Kelimeler

Feature selection, pattern recognition, spam filtering, YouTube

Kaynakça

[1] A. K. Uysal ve Y. L. Murphey, "Sentiment classification: Feature selection based approaches versus deep learning," in 17th IEEE International Conference on Computer and Information Technology (CIT), 2017, ss. 23-30.
[2] B. Parlak ve A. K. Uysal, "The impact of feature selection on medical document classification," in 2016 11th Iberian Conference on Information Systems and Technologies (CISTI), 2016, ss. 1-5.
[3] B. K. Akkuş ve R. Cakici, "Categorization of Turkish news documents with morphological analysis," in 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, 2013, ss. 1-8.
[4] P. P. K. Chan, C. Yang, D. S. Yeung ve W. W. Y. Ng, "Spam filtering for short messages in adversarial environment," Neurocomputing, cilt 155, ss. 167-176, 2015.
[5] A. K. Uysal, S. Gunal, S. Ergin ve E. S. Gunal, "The impact of feature extraction and selection on SMS spam filtering," Elektronika ir Elektrotechnika (Electronics and Electrical Engineering), 2013.
[6] T. C. Alberto, J. V. Lochter ve T. A. Almeida, "Tubespam: Comment spam filtering on YouTube," in Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on, 2015, ss. 138-143: IEEE.
[7] A. Serbanoiu ve T. Rebedea, "Relevance-based ranking of video comments on YouTube," in Control Systems and Computer Science (CSCS), 2013 19th International Conference on, 2013, ss. 225-231: IEEE.
[8] C. Rădulescu, M. Dinsoreanu ve R. Potolea, "Identification of spam comments using natural language processing techniques," in Intelligent Computer Communication and Processing (ICCP), 2014 IEEE International Conference on, 2014, ss. 29-35: IEEE.
[9] M. Alsaleh, A. Alarifi, F. Al-Quayed ve A. Al-Salman, "Combating comment spam with machine learning approaches," in Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on, 2015, ss. 295-300: IEEE.
[10] S. Sharmin ve Z. Zaman, "Spam detection in social media employing machine learning tool for text mining," in Signal-Image Technology & Internet-Based Systems (SITIS), 2017 13th International Conference on, 2017, ss. 137-142: IEEE.
[11] A. O. Abdullah, M. A. Ali, M. Karabatak ve A. Sengur, "A comparative analysis of common YouTube comment spam filtering techniques," in Digital Forensic and Security (ISDFS), 2018 6th International Symposium on, 2018, ss. 1-5: IEEE.
[12] S. Aiyar ve N. P. Shetty, "N-Gram assisted Youtube spam comment detection," Procedia Computer Science, cilt 132, ss. 174-182, 2018.
[13] A. K. Uysal, "An improved global feature selection scheme for text classification," Expert Systems with Applications, cilt 43, ss. 82-92, 2016.
[14] W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu ve Z. Wang, "A novel feature selection algorithm for text categorization," Expert Systems with Applications, cilt 33, no. 1, ss. 1-5, 2007.
[15] A. K. Uysal ve S. Gunal, "A novel probabilistic feature selection method for text classification," Knowledge-Based Systems, cilt 36, ss. 226-235, 2012.
[16] W. Zong, F. Wu, L.-K. Chu ve D. Sculli, "A discriminative and semantic feature selection method for text categorization," International Journal of Production Economics, cilt 165, ss. 215-222, 2015.
[17] A. Rehman, K. Javed, H. A. Babri ve M. Saeed, "Relative discrimination criterion –A novel feature ranking method for text data," Expert Systems with Applications, cilt 42, no. 7, ss. 3670-3681, 2015.
[18] G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of Machine Learning Research, cilt 3, ss. 1289-1305, 2003.
[19] L. Jiang, Z. Cai, H. Zhang ve D. Wang, "Naive Bayes text classifiers: A locally weighted learning approach," Journal of Experimental & Theoretical Artificial Intelligence, cilt 25, no. 2, ss. 273-286, 2013/06/01 2013.
[20] A. K. Uysal, "On two-stage feature selection methods for text classification," IEEE Access, cilt 6, ss. 43233-43251, 2018.
[21] M. F. Porter, "An algorithm for suffix stripping," Program, cilt 14, no. 3, ss. 130-137, 1980.

Yıl 2018, Cilt: 1 Sayı: 1, 4 - 8, 27.12.2018

Alper Kürşat Uysal

Öz

Kaynakça

[1] A. K. Uysal ve Y. L. Murphey, "Sentiment classification: Feature selection based approaches versus deep learning," in 17th IEEE International Conference on Computer and Information Technology (CIT), 2017, ss. 23-30.
[2] B. Parlak ve A. K. Uysal, "The impact of feature selection on medical document classification," in 2016 11th Iberian Conference on Information Systems and Technologies (CISTI), 2016, ss. 1-5.
[3] B. K. Akkuş ve R. Cakici, "Categorization of Turkish news documents with morphological analysis," in 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, 2013, ss. 1-8.
[4] P. P. K. Chan, C. Yang, D. S. Yeung ve W. W. Y. Ng, "Spam filtering for short messages in adversarial environment," Neurocomputing, cilt 155, ss. 167-176, 2015.
[5] A. K. Uysal, S. Gunal, S. Ergin ve E. S. Gunal, "The impact of feature extraction and selection on SMS spam filtering," Elektronika ir Elektrotechnika (Electronics and Electrical Engineering), 2013.
[6] T. C. Alberto, J. V. Lochter ve T. A. Almeida, "Tubespam: Comment spam filtering on YouTube," in Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on, 2015, ss. 138-143: IEEE.
[7] A. Serbanoiu ve T. Rebedea, "Relevance-based ranking of video comments on YouTube," in Control Systems and Computer Science (CSCS), 2013 19th International Conference on, 2013, ss. 225-231: IEEE.
[8] C. Rădulescu, M. Dinsoreanu ve R. Potolea, "Identification of spam comments using natural language processing techniques," in Intelligent Computer Communication and Processing (ICCP), 2014 IEEE International Conference on, 2014, ss. 29-35: IEEE.
[9] M. Alsaleh, A. Alarifi, F. Al-Quayed ve A. Al-Salman, "Combating comment spam with machine learning approaches," in Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on, 2015, ss. 295-300: IEEE.
[10] S. Sharmin ve Z. Zaman, "Spam detection in social media employing machine learning tool for text mining," in Signal-Image Technology & Internet-Based Systems (SITIS), 2017 13th International Conference on, 2017, ss. 137-142: IEEE.
[11] A. O. Abdullah, M. A. Ali, M. Karabatak ve A. Sengur, "A comparative analysis of common YouTube comment spam filtering techniques," in Digital Forensic and Security (ISDFS), 2018 6th International Symposium on, 2018, ss. 1-5: IEEE.
[12] S. Aiyar ve N. P. Shetty, "N-Gram assisted Youtube spam comment detection," Procedia Computer Science, cilt 132, ss. 174-182, 2018.
[13] A. K. Uysal, "An improved global feature selection scheme for text classification," Expert Systems with Applications, cilt 43, ss. 82-92, 2016.
[14] W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu ve Z. Wang, "A novel feature selection algorithm for text categorization," Expert Systems with Applications, cilt 33, no. 1, ss. 1-5, 2007.
[15] A. K. Uysal ve S. Gunal, "A novel probabilistic feature selection method for text classification," Knowledge-Based Systems, cilt 36, ss. 226-235, 2012.
[16] W. Zong, F. Wu, L.-K. Chu ve D. Sculli, "A discriminative and semantic feature selection method for text categorization," International Journal of Production Economics, cilt 165, ss. 215-222, 2015.
[17] A. Rehman, K. Javed, H. A. Babri ve M. Saeed, "Relative discrimination criterion –A novel feature ranking method for text data," Expert Systems with Applications, cilt 42, no. 7, ss. 3670-3681, 2015.
[18] G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of Machine Learning Research, cilt 3, ss. 1289-1305, 2003.
[19] L. Jiang, Z. Cai, H. Zhang ve D. Wang, "Naive Bayes text classifiers: A locally weighted learning approach," Journal of Experimental & Theoretical Artificial Intelligence, cilt 25, no. 2, ss. 273-286, 2013/06/01 2013.
[20] A. K. Uysal, "On two-stage feature selection methods for text classification," IEEE Access, cilt 6, ss. 43233-43251, 2018.
[21] M. F. Porter, "An algorithm for suffix stripping," Program, cilt 14, no. 3, ss. 130-137, 1980.

Toplam 21 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Grafik, Sosyal ve Multimedya Verileri, Veri Madenciliği ve Bilgi Keşfi
Bölüm	Research Article
Yazarlar	Alper Kürşat Uysal
Yayımlanma Tarihi	27 Aralık 2018
Yayımlandığı Sayı	Yıl 2018 Cilt: 1 Sayı: 1

Kaynak Göster

IEEE	A. K. Uysal, “Feature Selection for Comment Spam Filtering on YouTube”, DataSCI, c. 1, sy. 1, ss. 4–8, 2018.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin