The Effect of Various Text Representation Methods for Sentiment Analysis on Movie Review Data with Different Machine Learning Methods

Veysel Göç; Muhammet Sinan Başarslan

doi:10.29109/gujsc.1498509

Araştırma Makalesi

The Effect of Various Text Representation Methods for Sentiment Analysis on Movie Review Data with Different Machine Learning Methods

Yıl 2024, Erken Görünüm, 1 - 1

Veysel Göç , Muhammet Sinan Başarslan

https://doi.org/10.29109/gujsc.1498509

Öz

In this study, we explore the potential of machine learning (ML) models after different text representation methods on the balanced IMDB dataset, which is widely regarded as a gold standard in sentiment analysis, one of the Natural Language processing (NLP) tasks. On the open source IMDB movie reviews dataset, we first undertake data cleaning and text representation with data preprocessing steps. Then, we apply sentiment classification using different ML models. In order to evaluate the models, we used precision (P), recall (R), F1-score (F1), and area under curve (AUC), as well as receiver operating characteristic (ROC). It is worth noting that text feature extraction with Bidirectional Encoder Representations from Transformers (BERT) provided the highest performance in all models, with the SVM model offering particularly promising results. In this model, we observed the following results: ACC 0.9033, F1 0.9308, R 0.9015, R 0.9015, P 0.9072, AUC 0.9638, and ROC 0.96. These findings suggest that NLP techniques and, in particular, machine learning models that employ BERT may offer high levels of accuracy and reliability in text classification problems. It would be beneficial for future studies to validate these findings using BERT on different NLP tasks. This would help to evaluate the effectiveness and applicability of the models in practice.

Anahtar Kelimeler

Machine learning, movie review, sentiment analysis, text representation.

Etik Beyan

None

Destekleyen Kurum

None

Proje Numarası

Yok

Teşekkür

None

Kaynakça

[1] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150, 2011.
[2] Z. Shaukat, A. A. Zulfiqar, C. Xiao, M. Azeem, and T. Mahmood, “Sentiment analysis on IMDB using lexicon and neural networks,” SN Appl Sci, vol. 2, no. 2, p. 148, Feb. 2020, doi: 10.1007/s42452-019-1926-x.
[3] O. Kaynar, Y. Görmez, M. Yldz, and A. Albayrak, “Makine öğrenmesi yöntemleri ile Duygu Analizi,” in International Artificial Intelligence and Data Processing Symposium (IDAP’16), 2016, pp. 17–18.
[4] K. Amulya, S. B. Swathi, P. Kamakshi, and Y. Bhavani, “Sentiment Analysis on IMDB Movie Reviews using Machine Learning and Deep Learning Algorithms,” in 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), IEEE, Jan. 2022, pp. 814–819. doi: 10.1109/ICSSIT53264.2022.9716550.
[5] A. Misini, A. Kadriu, and E. Canhasi, “Albanian Authorship Attribution Model,” in 2023 12th Mediterranean Conference on Embedded Computing (MECO), IEEE, Jun. 2023, pp. 1–5. doi: 10.1109/MECO58584.2023.10155046.
[6] M. S. Basarslan and F. Kayaalp, “Sentiment Analysis with Various Deep Learning Models on Movie Reviews,” in 2022 International Conference on Artificial Intelligence of Things (ICAIoT), IEEE, Dec. 2022, pp. 1–5. doi: 10.1109/ICAIoT57170.2022.10121745.
[7] M. Mohaiminul and N. Sultana, “Comparative Study on Machine Learning Algorithms for Sentiment Classification,” Int J Comput Appl, vol. 182, no. 21, pp. 1–7, Oct. 2018, doi: 10.5120/ijca2018917961.
[8] S. N. Başa and M. S. Basarslan, “Sentiment Analysis Using Machine Learning Techniques on IMDB Dataset,” in 2023 7th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), IEEE, Oct. 2023, pp. 1–5. doi: 10.1109/ISMSIT58785.2023.10304923.
[9] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment Classification using Machine Learning Techniques,” May 2002.
[10] Y. Kim and O. Zhang, “Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification,” May 2014.
[11] L. Richardson, “Beautiful Soup Documentation Release 4.4.0,” 2019.
[12] J. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems, 2013. [Online]. Available: https://proceedings.neurips.cc/paper
[13] M. S. Başarslan and F. Kayaalp, “Sentiment analysis of coronavirus data with ensemble and machine learning methods,” Turkish Journal of Engineering, vol. 8, no. 2, pp. 175–185, Apr. 2024, doi: 10.31127/tuje.1352481.
[14] M. B. Çakı and M. S. Başarslan, “Classification of fake news using machine learning and deep learning”, Journal of Artificial Intelligence and Data Science, vol. 4, no. 1, pp. 22–32, 2024.
[15] P. P. Shinde and S. Shah, “A Review of Machine Learning and Deep Learning Applications,” in 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE, Aug. 2018, pp. 1–6. doi: 10.1109/ICCUBEA.2018.8697857.
[16] S. Saifullah, R. Dreżewski, F. A. Dwiyanto, A. S. Aribowo, Y. Fauziah, and N. H. Cahyana, “Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection,” Applied Sciences, vol. 14, no. 3, p. 1078, Jan. 2024, doi: 10.3390/app14031078.
[17] M. Granik and V. Mesyura, “Fake news detection using naive Bayes classifier,” in 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), IEEE, May 2017, pp. 900–903. doi: 10.1109/UKRCON.2017.8100379.
[18] M. P. LaValley, “Logistic Regression,” Circulation, vol. 117, no. 18, pp. 2395–2399, May 2008, doi: 10.1161/circulationaha.106.682658.
[19] M. H. L. Louk and B. A. Tama, “Dual-IDS: A bagging-based gradient boosting decision tree model for network anomaly intrusion detection system,” Expert Syst Appl, vol. 213, p. 119030, Mar. 2023, doi: 10.1016/j.eswa.2022.119030.
[20] L. Breiman, “Random Forests,” Mach Learn, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
[21] M. Z. Khaliki and M. S. Başarslan, “Brain tumor detection from images and comparison with transfer learning methods and 3-layer CNN,” Sci Rep, vol. 14, no. 1, p. 2664, Feb. 2024, doi: 10.1038/s41598-024-52823-9.
[22] T. Öztürk, Z. Turgut, G. Akgün, and C. Köse, “Machine learning-based intrusion detection for SCADA systems in healthcare,” Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 11, no. 1, p. 47, Dec. 2022, doi: 10.1007/s13721-022-00390-2.
[23] H. A. Ardaç and P. Erdoğmuş, “Question answering system with text mining and deep networks,” Evolving Systems, May 2024, doi: 10.1007/s12530-024-09592-7.

Yıl 2024, Erken Görünüm, 1 - 1

Veysel Göç , Muhammet Sinan Başarslan

https://doi.org/10.29109/gujsc.1498509

Öz

Proje Numarası

Yok

Kaynakça

[1] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150, 2011.
[2] Z. Shaukat, A. A. Zulfiqar, C. Xiao, M. Azeem, and T. Mahmood, “Sentiment analysis on IMDB using lexicon and neural networks,” SN Appl Sci, vol. 2, no. 2, p. 148, Feb. 2020, doi: 10.1007/s42452-019-1926-x.
[3] O. Kaynar, Y. Görmez, M. Yldz, and A. Albayrak, “Makine öğrenmesi yöntemleri ile Duygu Analizi,” in International Artificial Intelligence and Data Processing Symposium (IDAP’16), 2016, pp. 17–18.
[4] K. Amulya, S. B. Swathi, P. Kamakshi, and Y. Bhavani, “Sentiment Analysis on IMDB Movie Reviews using Machine Learning and Deep Learning Algorithms,” in 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), IEEE, Jan. 2022, pp. 814–819. doi: 10.1109/ICSSIT53264.2022.9716550.
[5] A. Misini, A. Kadriu, and E. Canhasi, “Albanian Authorship Attribution Model,” in 2023 12th Mediterranean Conference on Embedded Computing (MECO), IEEE, Jun. 2023, pp. 1–5. doi: 10.1109/MECO58584.2023.10155046.
[6] M. S. Basarslan and F. Kayaalp, “Sentiment Analysis with Various Deep Learning Models on Movie Reviews,” in 2022 International Conference on Artificial Intelligence of Things (ICAIoT), IEEE, Dec. 2022, pp. 1–5. doi: 10.1109/ICAIoT57170.2022.10121745.
[7] M. Mohaiminul and N. Sultana, “Comparative Study on Machine Learning Algorithms for Sentiment Classification,” Int J Comput Appl, vol. 182, no. 21, pp. 1–7, Oct. 2018, doi: 10.5120/ijca2018917961.
[8] S. N. Başa and M. S. Basarslan, “Sentiment Analysis Using Machine Learning Techniques on IMDB Dataset,” in 2023 7th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), IEEE, Oct. 2023, pp. 1–5. doi: 10.1109/ISMSIT58785.2023.10304923.
[9] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment Classification using Machine Learning Techniques,” May 2002.
[10] Y. Kim and O. Zhang, “Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification,” May 2014.
[11] L. Richardson, “Beautiful Soup Documentation Release 4.4.0,” 2019.
[12] J. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems, 2013. [Online]. Available: https://proceedings.neurips.cc/paper
[13] M. S. Başarslan and F. Kayaalp, “Sentiment analysis of coronavirus data with ensemble and machine learning methods,” Turkish Journal of Engineering, vol. 8, no. 2, pp. 175–185, Apr. 2024, doi: 10.31127/tuje.1352481.
[14] M. B. Çakı and M. S. Başarslan, “Classification of fake news using machine learning and deep learning”, Journal of Artificial Intelligence and Data Science, vol. 4, no. 1, pp. 22–32, 2024.
[15] P. P. Shinde and S. Shah, “A Review of Machine Learning and Deep Learning Applications,” in 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE, Aug. 2018, pp. 1–6. doi: 10.1109/ICCUBEA.2018.8697857.
[16] S. Saifullah, R. Dreżewski, F. A. Dwiyanto, A. S. Aribowo, Y. Fauziah, and N. H. Cahyana, “Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection,” Applied Sciences, vol. 14, no. 3, p. 1078, Jan. 2024, doi: 10.3390/app14031078.
[17] M. Granik and V. Mesyura, “Fake news detection using naive Bayes classifier,” in 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), IEEE, May 2017, pp. 900–903. doi: 10.1109/UKRCON.2017.8100379.
[18] M. P. LaValley, “Logistic Regression,” Circulation, vol. 117, no. 18, pp. 2395–2399, May 2008, doi: 10.1161/circulationaha.106.682658.
[19] M. H. L. Louk and B. A. Tama, “Dual-IDS: A bagging-based gradient boosting decision tree model for network anomaly intrusion detection system,” Expert Syst Appl, vol. 213, p. 119030, Mar. 2023, doi: 10.1016/j.eswa.2022.119030.
[20] L. Breiman, “Random Forests,” Mach Learn, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
[21] M. Z. Khaliki and M. S. Başarslan, “Brain tumor detection from images and comparison with transfer learning methods and 3-layer CNN,” Sci Rep, vol. 14, no. 1, p. 2664, Feb. 2024, doi: 10.1038/s41598-024-52823-9.
[22] T. Öztürk, Z. Turgut, G. Akgün, and C. Köse, “Machine learning-based intrusion detection for SCADA systems in healthcare,” Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 11, no. 1, p. 47, Dec. 2022, doi: 10.1007/s13721-022-00390-2.
[23] H. A. Ardaç and P. Erdoğmuş, “Question answering system with text mining and deep networks,” Evolving Systems, May 2024, doi: 10.1007/s12530-024-09592-7.

Toplam 23 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Karar Desteği ve Grup Destek Sistemleri, Bilgi Sistemleri (Diğer)
Bölüm	Tasarım ve Teknoloji
Yazarlar	Veysel Göç 0009-0008-9598-2786 Muhammet Sinan Başarslan 0000-0002-7996-9169
Proje Numarası	Yok
Erken Görünüm Tarihi	21 Kasım 2024
Yayımlanma Tarihi
Gönderilme Tarihi	9 Haziran 2024
Kabul Tarihi	6 Ekim 2024
Yayımlandığı Sayı	Yıl 2024 Erken Görünüm

Kaynak Göster

APA	Göç, V., & Başarslan, M. S. (2024). The Effect of Various Text Representation Methods for Sentiment Analysis on Movie Review Data with Different Machine Learning Methods. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım Ve Teknoloji1-1. https://doi.org/10.29109/gujsc.1498509

Makale Dosyaları

Tam Metin

e-ISSN:2147-9526