Tweet Verileri İçin Metin Sınıflandırmasında Gelişmiş Makine Öğrenmesi Modelleri: CatBoost ve LightGBM ile Performans Karşılaştırması

Kamil Abdullah Eşidir

doi:10.54525/bbmd.1677261

Research Article

Tweet Verileri İçin Metin Sınıflandırmasında Gelişmiş Makine Öğrenmesi Modelleri: CatBoost ve LightGBM ile Performans Karşılaştırması

Year 2025, Volume: 18 Issue: 2, 112 - 123, 22.12.2025

Kamil Abdullah Eşidir

https://doi.org/10.54525/bbmd.1677261

Abstract

Çalışmada, sosyal medya tabanlı metinlerden oluşan ikili sınıflandırma problemi kapsamında duygu analizi gerçekleştirilmiştir. Analiz sürecinde, metinler dil ön işleme adımlarından geçirilmiş ve cümle düzeyinde çok dilli BERT modeli kullanılarak vektörleştirilmiştir. Dengesiz sınıf dağılımı problemi ise SMOTE (Synthetic Minority Over-sampling Technique) yöntemi ile dengelenmiştir. Sınıflandırmada LightGBM ve CatBoost makine öğrenmesi modelleri tercih edilmiştir. Modellere beş katlı çapraz doğrulama uygulanarak, doğruluk, F1 skoru, duyarlılık, özgüllük ve ROC-AUC gibi çeşitli performans metrikleri hesaplanmıştır. Görsel analizlerde metin uzunluğu, kelime sayısı ve kelime bulutu benzeri yapısal dağılımlar incelenmiştir. Elde edilen sonuçlara göre her iki model de yüksek sınıflandırma başarısı göstermiştir. CatBoost doğruluk (%87,4), F1 skoru (0,763), hassasiyet (0,737) ve duyarlılık (0,793) ölçütlerinde LightGBM’ye kıyasla tutarlı bir üstünlük sağlamıştır. Pozitif sınıfı daha başarılı tanıması ve dengeli genel performansı ile öne çıkmıştır. İki modelin ROC-AUC değeri ise eşit (0,926) bulunmuş ve sınıflar arası ayrım gücünün yüksek olduğu anlaşılmıştır. Elde edilen sonuçlar, gelişmiş vektörleştirme tekniklerinin makine öğrenmesi modelleri ile bütünleştiğinde duygu analizinde etkili çıktılar üretebildiğini ortaya koymaktadır.

Keywords

Veri madenciliği , Metin madenciliği , Duygu analizi , Makine öğrenmesi , BERT

Thanks

Teknik analizler, veri ön işleme süreçleri ve Python dili ile programlama sürecinde sağladıkları rehberlik ve geri bildirimler için C ve Sistem Programcıları Derneği çalışanlarına (https://csystem.org/) ve Dernek Başkanı Dr. Kaan ASLAN’a teşekkür ederim. Dr. Kaan ASLAN, özellikle veri temizleme, model hiperparametre optimizasyonu ve performans değerlendirme süreçlerine katkı sağlamış olup, çalışmanın doğruluk oranlarının artırılması konusunda önerilerde bulunmuştur.

References

Albayrak, M., Topal, K., & Altıntaş, V. (2017). Sosyal medya üzerinde veri analizi: Twitter. Süleyman Demirel Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi, 22 (Kayfor 15 Özel Sayısı), 1991-1998.
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1), 1–167. https://doi.org/10.2200/S00416ED1V01Y201204HLT016
Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2017). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28(2), 15–21. http://dx.doi.org/10.1109/MIS.2013.30
Eşidir, K. A., & Gür, Y. E. (2023). Yapay sinir ağları ile Türkiye plastik sektörü ithalat tahmini: 2023 yılı nisan-aralık ayları. Akademik Hassasiyetler, 10(23), 91-114. https://doi.org/10.58884/akademik-hassasiyetler.1307536
Bae, C. Y., Im, Y., Lee, J., Park, C., Kim, M., Kwon, H. U., & Kim, J. (2021). Comparison of biological age prediction models using clinical biomarkers commonly measured in clinical practice settings: AI techniques vs. traditional statistical methods. Frontiers in Analytical Science, 1. https://doi.org/10.3389/frans.2021.709589
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6638–6648. DOI: 10.5555/3327757.3327770
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), 4765–4774. https://doi.org/10.48550/arXiv.1705.07874
Onan, A. (2017). Twitter Mesajları Üzerınde Makine Öğrenmesi Yöntemlerine Dayalı Duygu Analizi. Yönetim Bilişim Sistemleri Dergisi, 3(2), 1-14. https://dergipark.org.tr/tr/pub/ybs/issue/33128/368593
Çelik, Ö., Osmanoğlu, U. Ö., & Çanakçı, B. (2020). SENTIMENT ANALYSIS FROM SOCIAL MEDIA COMMENTS. Mühendislik Bilimleri ve Tasarım Dergisi, 8(2), 366-374. https://doi.org/10.21923/jesd.546224
Taşkın, S. G., Küçüksille, E. U., & Topal, K. (2022). Detection of Turkish fake news in Twitter with machine learning algorithms. Arabian Journal for Science and Engineering, 47(2), 2359–2379. https://doi.org/10.1007/s13369-021-06223-0
Kına, E., & Biçek, E. (2023). Duygu Analizinde Denetimli Makine Öğrenme Algoritmalarının Karşılaştırılmaları, (Kahramanmaraş Depremi Örneği). Batman Üniversitesi Yaşam Bilimleri Dergisi, 13(1), 21-31. https://doi.org/10.55024/buyasambid.1295878
Arzu, M., & Aydoğan, M. (2023). Türkçe Duygu Sınıflandırma İçin Transformers Tabanlı Mimarilerin Karşılaştırılmalı Analizi. Computer Science, IDAP-2023 : International Artificial Intelligence and Data Processing Symposium(IDAP-2023), 1-6. https://doi.org/10.53070/bbd.1350405
Raihen, M. N., & Akter, S. (2024). Sentiment analysis of passenger feedback on U.S. airlines using machine learning classification methods. World Journal of Advanced Research and Reviews, 23(01), 2260–2273. https://doi.org/10.30574/wjarr.2024.23.1.2183
Sazan, S. A., Ahmed, M., Saad, T. B., & Roy, M. (2024). Advanced natural language processing techniques for efficient sentiment analysis of US airline Twitter data: A high-performance framework for extracting insights from tweets. In 2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT), Dhaka, Bangladesh, (pp. 1-6).
Najafi, A., & Varol, O. (2024). TurkishBERTweet: Fast and reliable large language model for social media analysis. Expert Systems with Applications, 255, 124737. https://doi.org/10.1016/j.eswa.2024.124737
Koru, G. K., & Uluyol, Ç. (2024). Detection of Turkish fake news from tweets with BERT models. IEEE Access, 12, 14918–14931. https://doi.org/10.1109/ACCESS.2024.3354165
Piyasamara, D. (2022). Sentiment Analysis Dataset - Binary Classification [Veri kümesi]. Kaggle. https://www.kaggle.com/datasets/dineshpiyasamara/sentiment-analysis-dataset
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154. DOI: 10.5555/3294996.3295074
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6639–6649. DOI: 10.5555/3327757.3327770
Eşidir, K. A. (2025). Makine Öğrenimi Modelleri İle Yetişkin Eğitimi Analizi: Modellerin Karşılaştırmalı Performansı. Elektronik Sosyal Bilimler Dergisi, 24(2), 946-964. https://doi.org/10.17755/esosder.1589887

Advanced Machine Learning Models for Text Classification of Tweet Data: Performance Comparison with CatBoost and LightGBM

Year 2025, Volume: 18 Issue: 2, 112 - 123, 22.12.2025

Kamil Abdullah Eşidir

https://doi.org/10.54525/bbmd.1677261

Abstract

In this study, sentiment analysis was performed for a binary classification problem consisting of social media-based texts. In the analysis process, the texts were subjected to language preprocessing steps and vectorized using the multilingual BERT model at the sentence level. The unbalanced class distribution problem was balanced with the SMOTE (Synthetic Minority Over-sampling Technique) method. LightGBM and CatBoost machine learning models were preferred for classification. Five-fold cross-validation was applied to the models and various performance metrics such as accuracy, F1 score, sensitivity, specificity and ROC-AUC were calculated. In visual analysis, text length, word count and word cloud-like structural distributions were analyzed. According to the results, both models showed high classification performance. CatBoost consistently outperformed LightGBM in accuracy (87.4%), F1 score (0.763), precision (0.737) and sensitivity (0.793). It stood out with its better recognition of the positive class and balanced overall performance. The ROC-AUC value of the two models was equal (0.926), indicating high discrimination power between classes. The results show that advanced vectorization techniques can produce effective outputs in sentiment analysis when integrated with machine learning models.

Keywords

Data mining , Text mining , Sentiment analysis , Machine learning , BERT

Thanks

I would like to thank the employees of the C and Systems Programmers Association (https://csystem.org/) and the Association President Dr. Kaan ASLAN for their guidance and feedback during the technical analysis, data preprocessing processes, and programming process using the Python language. Dr. Kaan ASLAN contributed significantly to data cleaning, model hyperparameter optimization, and performance evaluation processes, and provided recommendations for improving the accuracy rates of the study.

References

Albayrak, M., Topal, K., & Altıntaş, V. (2017). Sosyal medya üzerinde veri analizi: Twitter. Süleyman Demirel Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi, 22 (Kayfor 15 Özel Sayısı), 1991-1998.
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1), 1–167. https://doi.org/10.2200/S00416ED1V01Y201204HLT016
Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2017). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28(2), 15–21. http://dx.doi.org/10.1109/MIS.2013.30
Eşidir, K. A., & Gür, Y. E. (2023). Yapay sinir ağları ile Türkiye plastik sektörü ithalat tahmini: 2023 yılı nisan-aralık ayları. Akademik Hassasiyetler, 10(23), 91-114. https://doi.org/10.58884/akademik-hassasiyetler.1307536
Bae, C. Y., Im, Y., Lee, J., Park, C., Kim, M., Kwon, H. U., & Kim, J. (2021). Comparison of biological age prediction models using clinical biomarkers commonly measured in clinical practice settings: AI techniques vs. traditional statistical methods. Frontiers in Analytical Science, 1. https://doi.org/10.3389/frans.2021.709589
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6638–6648. DOI: 10.5555/3327757.3327770
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), 4765–4774. https://doi.org/10.48550/arXiv.1705.07874
Onan, A. (2017). Twitter Mesajları Üzerınde Makine Öğrenmesi Yöntemlerine Dayalı Duygu Analizi. Yönetim Bilişim Sistemleri Dergisi, 3(2), 1-14. https://dergipark.org.tr/tr/pub/ybs/issue/33128/368593
Çelik, Ö., Osmanoğlu, U. Ö., & Çanakçı, B. (2020). SENTIMENT ANALYSIS FROM SOCIAL MEDIA COMMENTS. Mühendislik Bilimleri ve Tasarım Dergisi, 8(2), 366-374. https://doi.org/10.21923/jesd.546224
Taşkın, S. G., Küçüksille, E. U., & Topal, K. (2022). Detection of Turkish fake news in Twitter with machine learning algorithms. Arabian Journal for Science and Engineering, 47(2), 2359–2379. https://doi.org/10.1007/s13369-021-06223-0
Kına, E., & Biçek, E. (2023). Duygu Analizinde Denetimli Makine Öğrenme Algoritmalarının Karşılaştırılmaları, (Kahramanmaraş Depremi Örneği). Batman Üniversitesi Yaşam Bilimleri Dergisi, 13(1), 21-31. https://doi.org/10.55024/buyasambid.1295878
Arzu, M., & Aydoğan, M. (2023). Türkçe Duygu Sınıflandırma İçin Transformers Tabanlı Mimarilerin Karşılaştırılmalı Analizi. Computer Science, IDAP-2023 : International Artificial Intelligence and Data Processing Symposium(IDAP-2023), 1-6. https://doi.org/10.53070/bbd.1350405
Raihen, M. N., & Akter, S. (2024). Sentiment analysis of passenger feedback on U.S. airlines using machine learning classification methods. World Journal of Advanced Research and Reviews, 23(01), 2260–2273. https://doi.org/10.30574/wjarr.2024.23.1.2183
Sazan, S. A., Ahmed, M., Saad, T. B., & Roy, M. (2024). Advanced natural language processing techniques for efficient sentiment analysis of US airline Twitter data: A high-performance framework for extracting insights from tweets. In 2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT), Dhaka, Bangladesh, (pp. 1-6).
Najafi, A., & Varol, O. (2024). TurkishBERTweet: Fast and reliable large language model for social media analysis. Expert Systems with Applications, 255, 124737. https://doi.org/10.1016/j.eswa.2024.124737
Koru, G. K., & Uluyol, Ç. (2024). Detection of Turkish fake news from tweets with BERT models. IEEE Access, 12, 14918–14931. https://doi.org/10.1109/ACCESS.2024.3354165
Piyasamara, D. (2022). Sentiment Analysis Dataset - Binary Classification [Veri kümesi]. Kaggle. https://www.kaggle.com/datasets/dineshpiyasamara/sentiment-analysis-dataset
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154. DOI: 10.5555/3294996.3295074
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6639–6649. DOI: 10.5555/3327757.3327770
Eşidir, K. A. (2025). Makine Öğrenimi Modelleri İle Yetişkin Eğitimi Analizi: Modellerin Karşılaştırmalı Performansı. Elektronik Sosyal Bilimler Dergisi, 24(2), 946-964. https://doi.org/10.17755/esosder.1589887

There are 21 citations in total.

Details

Primary Language	Turkish
Subjects	Business Process Management, Decision Support and Group Support Systems, Management Information Systems
Journal Section	Research Article
Authors	Kamil Abdullah Eşidir 0000-0002-8106-1758
Submission Date	May 28, 2025
Acceptance Date	July 23, 2025
Early Pub Date	December 16, 2025
Publication Date	December 22, 2025
Published in Issue	Year 2025 Volume: 18 Issue: 2

Cite

IEEE	[1]K. A. Eşidir, “Tweet Verileri İçin Metin Sınıflandırmasında Gelişmiş Makine Öğrenmesi Modelleri: CatBoost ve LightGBM ile Performans Karşılaştırması”, Bilgisayar Bilimleri ve Mühendisliği Dergisi, vol. 18, no. 2, pp. 112–123, Dec. 2025, doi: 10.54525/bbmd.1677261.

Article Files

Full Text