The Effect of Various Text Representation Methods for Sentiment Analysis on Movie Review Data with Different Machine Learning Methods

Veysel Göç; Muhammet Sinan Başarslan

doi:10.29109/gujsc.1498509

Research Article

The Effect of Various Text Representation Methods for Sentiment Analysis on Movie Review Data with Different Machine Learning Methods

Year 2024, Erken Görünüm, 1 - 1

Veysel Göç , Muhammet Sinan Başarslan

https://doi.org/10.29109/gujsc.1498509

Abstract

In this study, we explore the potential of machine learning (ML) models after different text representation methods on the balanced IMDB dataset, which is widely regarded as a gold standard in sentiment analysis, one of the Natural Language processing (NLP) tasks. On the open source IMDB movie reviews dataset, we first undertake data cleaning and text representation with data preprocessing steps. Then, we apply sentiment classification using different ML models. In order to evaluate the models, we used precision (P), recall (R), F1-score (F1), and area under curve (AUC), as well as receiver operating characteristic (ROC). It is worth noting that text feature extraction with Bidirectional Encoder Representations from Transformers (BERT) provided the highest performance in all models, with the SVM model offering particularly promising results. In this model, we observed the following results: ACC 0.9033, F1 0.9308, R 0.9015, R 0.9015, P 0.9072, AUC 0.9638, and ROC 0.96. These findings suggest that NLP techniques and, in particular, machine learning models that employ BERT may offer high levels of accuracy and reliability in text classification problems. It would be beneficial for future studies to validate these findings using BERT on different NLP tasks. This would help to evaluate the effectiveness and applicability of the models in practice.

Keywords

Machine learning, movie review, sentiment analysis, text representation.

Ethical Statement

None

Supporting Institution

None

Project Number

Yok

Thanks

None

References

[1] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150, 2011.
[2] Z. Shaukat, A. A. Zulfiqar, C. Xiao, M. Azeem, and T. Mahmood, “Sentiment analysis on IMDB using lexicon and neural networks,” SN Appl Sci, vol. 2, no. 2, p. 148, Feb. 2020, doi: 10.1007/s42452-019-1926-x.
[3] O. Kaynar, Y. Görmez, M. Yldz, and A. Albayrak, “Makine öğrenmesi yöntemleri ile Duygu Analizi,” in International Artificial Intelligence and Data Processing Symposium (IDAP’16), 2016, pp. 17–18.
[4] K. Amulya, S. B. Swathi, P. Kamakshi, and Y. Bhavani, “Sentiment Analysis on IMDB Movie Reviews using Machine Learning and Deep Learning Algorithms,” in 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), IEEE, Jan. 2022, pp. 814–819. doi: 10.1109/ICSSIT53264.2022.9716550.
[5] A. Misini, A. Kadriu, and E. Canhasi, “Albanian Authorship Attribution Model,” in 2023 12th Mediterranean Conference on Embedded Computing (MECO), IEEE, Jun. 2023, pp. 1–5. doi: 10.1109/MECO58584.2023.10155046.
[6] M. S. Basarslan and F. Kayaalp, “Sentiment Analysis with Various Deep Learning Models on Movie Reviews,” in 2022 International Conference on Artificial Intelligence of Things (ICAIoT), IEEE, Dec. 2022, pp. 1–5. doi: 10.1109/ICAIoT57170.2022.10121745.
[7] M. Mohaiminul and N. Sultana, “Comparative Study on Machine Learning Algorithms for Sentiment Classification,” Int J Comput Appl, vol. 182, no. 21, pp. 1–7, Oct. 2018, doi: 10.5120/ijca2018917961.
[8] S. N. Başa and M. S. Basarslan, “Sentiment Analysis Using Machine Learning Techniques on IMDB Dataset,” in 2023 7th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), IEEE, Oct. 2023, pp. 1–5. doi: 10.1109/ISMSIT58785.2023.10304923.
[9] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment Classification using Machine Learning Techniques,” May 2002.
[10] Y. Kim and O. Zhang, “Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification,” May 2014.
[11] L. Richardson, “Beautiful Soup Documentation Release 4.4.0,” 2019.
[12] J. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems, 2013. [Online]. Available: https://proceedings.neurips.cc/paper
[13] M. S. Başarslan and F. Kayaalp, “Sentiment analysis of coronavirus data with ensemble and machine learning methods,” Turkish Journal of Engineering, vol. 8, no. 2, pp. 175–185, Apr. 2024, doi: 10.31127/tuje.1352481.
[14] M. B. Çakı and M. S. Başarslan, “Classification of fake news using machine learning and deep learning”, Journal of Artificial Intelligence and Data Science, vol. 4, no. 1, pp. 22–32, 2024.
[15] P. P. Shinde and S. Shah, “A Review of Machine Learning and Deep Learning Applications,” in 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE, Aug. 2018, pp. 1–6. doi: 10.1109/ICCUBEA.2018.8697857.
[16] S. Saifullah, R. Dreżewski, F. A. Dwiyanto, A. S. Aribowo, Y. Fauziah, and N. H. Cahyana, “Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection,” Applied Sciences, vol. 14, no. 3, p. 1078, Jan. 2024, doi: 10.3390/app14031078.
[17] M. Granik and V. Mesyura, “Fake news detection using naive Bayes classifier,” in 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), IEEE, May 2017, pp. 900–903. doi: 10.1109/UKRCON.2017.8100379.
[18] M. P. LaValley, “Logistic Regression,” Circulation, vol. 117, no. 18, pp. 2395–2399, May 2008, doi: 10.1161/circulationaha.106.682658.
[19] M. H. L. Louk and B. A. Tama, “Dual-IDS: A bagging-based gradient boosting decision tree model for network anomaly intrusion detection system,” Expert Syst Appl, vol. 213, p. 119030, Mar. 2023, doi: 10.1016/j.eswa.2022.119030.
[20] L. Breiman, “Random Forests,” Mach Learn, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
[21] M. Z. Khaliki and M. S. Başarslan, “Brain tumor detection from images and comparison with transfer learning methods and 3-layer CNN,” Sci Rep, vol. 14, no. 1, p. 2664, Feb. 2024, doi: 10.1038/s41598-024-52823-9.
[22] T. Öztürk, Z. Turgut, G. Akgün, and C. Köse, “Machine learning-based intrusion detection for SCADA systems in healthcare,” Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 11, no. 1, p. 47, Dec. 2022, doi: 10.1007/s13721-022-00390-2.
[23] H. A. Ardaç and P. Erdoğmuş, “Question answering system with text mining and deep networks,” Evolving Systems, May 2024, doi: 10.1007/s12530-024-09592-7.

Year 2024, Erken Görünüm, 1 - 1

Veysel Göç , Muhammet Sinan Başarslan

https://doi.org/10.29109/gujsc.1498509

Abstract

Project Number

Yok

References

[1] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150, 2011.
[2] Z. Shaukat, A. A. Zulfiqar, C. Xiao, M. Azeem, and T. Mahmood, “Sentiment analysis on IMDB using lexicon and neural networks,” SN Appl Sci, vol. 2, no. 2, p. 148, Feb. 2020, doi: 10.1007/s42452-019-1926-x.
[3] O. Kaynar, Y. Görmez, M. Yldz, and A. Albayrak, “Makine öğrenmesi yöntemleri ile Duygu Analizi,” in International Artificial Intelligence and Data Processing Symposium (IDAP’16), 2016, pp. 17–18.
[4] K. Amulya, S. B. Swathi, P. Kamakshi, and Y. Bhavani, “Sentiment Analysis on IMDB Movie Reviews using Machine Learning and Deep Learning Algorithms,” in 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), IEEE, Jan. 2022, pp. 814–819. doi: 10.1109/ICSSIT53264.2022.9716550.
[5] A. Misini, A. Kadriu, and E. Canhasi, “Albanian Authorship Attribution Model,” in 2023 12th Mediterranean Conference on Embedded Computing (MECO), IEEE, Jun. 2023, pp. 1–5. doi: 10.1109/MECO58584.2023.10155046.
[6] M. S. Basarslan and F. Kayaalp, “Sentiment Analysis with Various Deep Learning Models on Movie Reviews,” in 2022 International Conference on Artificial Intelligence of Things (ICAIoT), IEEE, Dec. 2022, pp. 1–5. doi: 10.1109/ICAIoT57170.2022.10121745.
[7] M. Mohaiminul and N. Sultana, “Comparative Study on Machine Learning Algorithms for Sentiment Classification,” Int J Comput Appl, vol. 182, no. 21, pp. 1–7, Oct. 2018, doi: 10.5120/ijca2018917961.
[8] S. N. Başa and M. S. Basarslan, “Sentiment Analysis Using Machine Learning Techniques on IMDB Dataset,” in 2023 7th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), IEEE, Oct. 2023, pp. 1–5. doi: 10.1109/ISMSIT58785.2023.10304923.
[9] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment Classification using Machine Learning Techniques,” May 2002.
[10] Y. Kim and O. Zhang, “Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification,” May 2014.
[11] L. Richardson, “Beautiful Soup Documentation Release 4.4.0,” 2019.
[12] J. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems, 2013. [Online]. Available: https://proceedings.neurips.cc/paper
[13] M. S. Başarslan and F. Kayaalp, “Sentiment analysis of coronavirus data with ensemble and machine learning methods,” Turkish Journal of Engineering, vol. 8, no. 2, pp. 175–185, Apr. 2024, doi: 10.31127/tuje.1352481.
[14] M. B. Çakı and M. S. Başarslan, “Classification of fake news using machine learning and deep learning”, Journal of Artificial Intelligence and Data Science, vol. 4, no. 1, pp. 22–32, 2024.
[15] P. P. Shinde and S. Shah, “A Review of Machine Learning and Deep Learning Applications,” in 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE, Aug. 2018, pp. 1–6. doi: 10.1109/ICCUBEA.2018.8697857.
[16] S. Saifullah, R. Dreżewski, F. A. Dwiyanto, A. S. Aribowo, Y. Fauziah, and N. H. Cahyana, “Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection,” Applied Sciences, vol. 14, no. 3, p. 1078, Jan. 2024, doi: 10.3390/app14031078.
[17] M. Granik and V. Mesyura, “Fake news detection using naive Bayes classifier,” in 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), IEEE, May 2017, pp. 900–903. doi: 10.1109/UKRCON.2017.8100379.
[18] M. P. LaValley, “Logistic Regression,” Circulation, vol. 117, no. 18, pp. 2395–2399, May 2008, doi: 10.1161/circulationaha.106.682658.
[19] M. H. L. Louk and B. A. Tama, “Dual-IDS: A bagging-based gradient boosting decision tree model for network anomaly intrusion detection system,” Expert Syst Appl, vol. 213, p. 119030, Mar. 2023, doi: 10.1016/j.eswa.2022.119030.
[20] L. Breiman, “Random Forests,” Mach Learn, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
[21] M. Z. Khaliki and M. S. Başarslan, “Brain tumor detection from images and comparison with transfer learning methods and 3-layer CNN,” Sci Rep, vol. 14, no. 1, p. 2664, Feb. 2024, doi: 10.1038/s41598-024-52823-9.
[22] T. Öztürk, Z. Turgut, G. Akgün, and C. Köse, “Machine learning-based intrusion detection for SCADA systems in healthcare,” Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 11, no. 1, p. 47, Dec. 2022, doi: 10.1007/s13721-022-00390-2.
[23] H. A. Ardaç and P. Erdoğmuş, “Question answering system with text mining and deep networks,” Evolving Systems, May 2024, doi: 10.1007/s12530-024-09592-7.

There are 23 citations in total.

Details

Primary Language	English
Subjects	Decision Support and Group Support Systems, Information Systems (Other)
Journal Section	Tasarım ve Teknoloji
Authors	Veysel Göç 0009-0008-9598-2786 Muhammet Sinan Başarslan 0000-0002-7996-9169
Project Number	Yok
Early Pub Date	November 21, 2024
Publication Date
Submission Date	June 9, 2024
Acceptance Date	October 6, 2024
Published in Issue	Year 2024 Erken Görünüm

Cite

APA	Göç, V., & Başarslan, M. S. (2024). The Effect of Various Text Representation Methods for Sentiment Analysis on Movie Review Data with Different Machine Learning Methods. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım Ve Teknoloji1-1. https://doi.org/10.29109/gujsc.1498509

Article Files

Full Text

TRINDEX

e-ISSN:2147-9526