Conference Paper
BibTex RIS Cite

COVID-19 ile İlgili Sosyal Medya Gönderilerinin Metin Madenciliği Yöntemlerine Dayalı Olarak Zaman-Mekansal Analizi

Year 2021, Issue: 26 - Ejosat Special Issue 2021 (HORA), 138 - 143, 31.07.2021
https://doi.org/10.31590/ejosat.957020

Abstract

COVID-19, hastalığın ilk bildirildiği dönemden bu yana, şiddetli akut solunum sendromu büyük salgınlara neden olmaktadır ve dünya çapında bir pandemiye dönüşmüştür. Dünyanın birçok ülkesinde, COVID-19 salgınının zaman-mekansal analizine yönelik olarak önemli sayıda gerçek zamanlı, etkileşimli mobil ya da çevrimiçi coğrafi bilgi sistemleri, web siteleri ve uygulamalar geliştirilmiştir. Bilgi ve iletişim teknolojilerindeki ilerlemeler ile pek çok farklı kaynaktan COVID-19 salgınına yönelik olarak elde edilen veriler, salgın durumuna ilişkin bilgilerin etkin ve zamanında elde edilebilmesi için büyük önem taşımaktadır. Internetteki medya ve iletişim platformlarında paylaşılan haber makaleleri, bulaşıcı hastalık salgınlarının izlenmesi ve takip edilmesi için önemli bir veri kaynağı niteliğindedir. Bu çalışmada, İngiltere ve İspanya’da COVID-19 sürecine ilişkin 2020 yılının mart, mayıs ve temmuz aylarında yayınlanan 299’ar tane haber makalesi toplanarak oluşturulan derlem kullanılmaktadır. Metin belgelerinin temsilinde, üç temel n-gram modeli olan (1-gram, 2-gram ve 3-gram) temsilleri, tümce ögeleri 2-gram ve tümce ögeleri 3-gram öznitelikleri, kelime/tümce ögesi çiftleri, karakter n-gram (n=2) ve karakter n-gram (n=3) öznitelikleri ve bu özniteliklerin biraraya getirilmesi ile elde edilen topluluk öznitelik kümelerinin etkinlikleri değerlendirilmektedir. Öznitelik kümelerinin başarımlarının değerlendirilmesinde, altı temel makine öğrenmesi sınıflandırıcısı olan Naive Bayes algoritması, lojistik regresyon algoritması, destek vektör makineleri, C4.5 karar ağacı, k-en yakın komşu algoritması ve rastgele orman algoritması kullanılmaktadır. Deneysel analizlerde kullanılan on yedi farklı metin temsil yöntemi arasında en yüksek başarımın, sözcük tabanlı 1-gram özniteliklerin karakter tabanlı 3-gram modeli ile kullanıldığında elde edildiği görülmektedir. Deneysel analizlerde kullanılan temel sınıflandırma algoritmaları arasında en yüksek başarım rastgele orman algoritmasıyla, ikinci en yüksek başarım ise lojistik regresyon algoritmasıyla alınmaktadır. Deneysel analizler, makine öğrenmesi ve metin madenciliği tekniklerinin, salgın hastalıklara ilişkin sosyal medya gönderilerinin zaman/mekânsal analizi için uygun teknikler olduğunu göstermektedir.

References

  • Chawla, S., Mittal, M., Chawla, M., & Goyal, L. M. (2020). Corona virus-SARS-CoV-2: an insight to another way of natural disaster. EAI Endorsed Transactions on Pervasive Health and Technology, 6(22).
  • Wang, L. L., & Lo, K. (2021). Text mining approaches for dealing with the rapidly expanding literature on COVID-19. Briefings in Bioinformatics, 22(2), 781-799.
  • Gajewski, K. N., Peterson, A. E., Chitale, R. A., Pavlin, J. A., Russell, K. L., & Chretien, J. P. (2014). A review of evaluations of electronic event-based biosurveillance systems. PloS one, 9(10), e111222.
  • Onan, A., Korukoğlu, S., & Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232-247.
  • Onan, A. (2016). Classifier and feature set ensembles for web page classification. Journal of Information Science, 42(2), 150-165.
  • Onan, A., Korukoğlu, S., & Bulut, H. (2016). A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Systems with Applications, 62, 1-16.
  • Onan, A., & Korukoğlu, S. (2017). A feature selection model based on genetic rank aggregation for text sentiment classification. Journal of Information Science, 43(1), 25-38.
  • Onan, A. (2017). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes.
  • Onan, A., & Toçoğlu, M. A. (2021). A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification. IEEE Access, 9, 7701-7722.
  • Onan, A., Korukoğlu, S., & Bulut, H. (2017). A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification. Information Processing & Management, 53(4), 814-833.
  • Toçoğlu, M. A., & Onan, A. (2019, August). Satire detection in Turkish news articles: a machine learning approach. In International Conference on Big Data Innovations and Applications (pp. 107-117). Springer, Cham.
  • Onan, A. (2018, May). Review spam detection based on psychological and linguistic features. In 2018 26th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
  • Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28-47.
  • Jahanbin, K., & Rahmanian, V. (2020). Using Twitter and web news mining to predict COVID-19 outbreak. Asian Pacific Journal of Tropical Medicine, 13(8), 378.
  • Ordun, C., Purushotham, S., & Raff, E. (2020). Exploratory analysis of covid-19 tweets using topic modeling, umap, and digraphs. arXiv preprint arXiv:2005.03082.
  • Peng, Z., Wang, R., Liu, L., & Wu, H. (2020). Exploring urban spatial features of COVID-19 transmission in Wuhan based on social media data. ISPRS International Journal of Geo-Information, 9(6), 402.
  • Li, D., Chaudhary, H., & Zhang, Z. (2020). Modeling spatiotemporal pattern of depressive symptoms caused by COVID-19 using social media data mining. International Journal of Environmental Research and Public Health, 17(14), 4988.
  • Chen, N., Zhong, Z., & Pang, J. (2021). An exploratory study of COVID-19 information on twitter in the greater region. Big Data and Cognitive Computing, 5(1), 5.
  • Boon-Itt, S., & Skunkan, Y. (2020). Public perception of the COVID-19 pandemic on Twitter: sentiment analysis and topic modeling study. JMIR Public Health and Surveillance, 6(4), e21978.
  • Onan, A. (2017). Twitter mesajları üzerinde makine öğrenmesi yöntemlerine dayalı duygu analizi. Yönetim Bilişim Sistemleri, 3(2), 1-14.
  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
  • Roche, M. (2020). COVID-19 and Media datasets: Period-and location-specific textual data mining. Data in brief, 33, 106356.

Spatio-Temporal Analysis of Social Media Posts Related to COVID-19 Based on Text Mining Methods

Year 2021, Issue: 26 - Ejosat Special Issue 2021 (HORA), 138 - 143, 31.07.2021
https://doi.org/10.31590/ejosat.957020

Abstract

Since COVID-19 was first reported, severe acute respiratory syndrome has been causing massive outbreaks and has turned into a worldwide pandemic. In many countries of the world, a significant number of real-time, interactive mobile or online geographic information systems, websites and applications have been developed for the time-spatial analysis of the COVID-19 outbreak. The advances in information and communication technologies and the data obtained from many different sources regarding the COVID-19 outbreak are of great importance in order to obtain effective and timely information on the epidemic situation. News articles shared on media and communication platforms on the Internet are an important source of data for monitoring and tracking infectious disease outbreaks. In this study, 299 news articles published in March, May and July 2020 on the COVID-19 process in England and Spain are used. In the representation of text documents, the three basic n-gram models (1-gram, 2-gram, and 3-gram), part-of-speech 2-gram and part-of-speech 3-gram features, word / part-of-speech pairs, character n-gram (for, n = 2) and character n-gram (for, n = 3) features and the efficiency of the ensemble feature sets obtained by combining these features are evaluated. Naive Bayes algorithm, logistic regression algorithm, support vector machines, C4.5 decision tree, k-nearest neighbor algorithm and random forest algorithm are used to evaluate the performance of feature sets. Among the seventeen different text representation methods used in experimental analysis, it is seen that the highest performance is achieved when word-based unigram features are used with a character-based 3-gram model. Among the basic classification algorithms used in experimental analysis, the highest performance is obtained with the random forest algorithm, and the second highest performance is obtained with the logistic regression algorithm. Experimental analysis shows that machine learning and text mining techniques are suitable techniques for the spatio-temporal analysis of social media posts regarding epidemics.

References

  • Chawla, S., Mittal, M., Chawla, M., & Goyal, L. M. (2020). Corona virus-SARS-CoV-2: an insight to another way of natural disaster. EAI Endorsed Transactions on Pervasive Health and Technology, 6(22).
  • Wang, L. L., & Lo, K. (2021). Text mining approaches for dealing with the rapidly expanding literature on COVID-19. Briefings in Bioinformatics, 22(2), 781-799.
  • Gajewski, K. N., Peterson, A. E., Chitale, R. A., Pavlin, J. A., Russell, K. L., & Chretien, J. P. (2014). A review of evaluations of electronic event-based biosurveillance systems. PloS one, 9(10), e111222.
  • Onan, A., Korukoğlu, S., & Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232-247.
  • Onan, A. (2016). Classifier and feature set ensembles for web page classification. Journal of Information Science, 42(2), 150-165.
  • Onan, A., Korukoğlu, S., & Bulut, H. (2016). A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Systems with Applications, 62, 1-16.
  • Onan, A., & Korukoğlu, S. (2017). A feature selection model based on genetic rank aggregation for text sentiment classification. Journal of Information Science, 43(1), 25-38.
  • Onan, A. (2017). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes.
  • Onan, A., & Toçoğlu, M. A. (2021). A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification. IEEE Access, 9, 7701-7722.
  • Onan, A., Korukoğlu, S., & Bulut, H. (2017). A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification. Information Processing & Management, 53(4), 814-833.
  • Toçoğlu, M. A., & Onan, A. (2019, August). Satire detection in Turkish news articles: a machine learning approach. In International Conference on Big Data Innovations and Applications (pp. 107-117). Springer, Cham.
  • Onan, A. (2018, May). Review spam detection based on psychological and linguistic features. In 2018 26th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
  • Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28-47.
  • Jahanbin, K., & Rahmanian, V. (2020). Using Twitter and web news mining to predict COVID-19 outbreak. Asian Pacific Journal of Tropical Medicine, 13(8), 378.
  • Ordun, C., Purushotham, S., & Raff, E. (2020). Exploratory analysis of covid-19 tweets using topic modeling, umap, and digraphs. arXiv preprint arXiv:2005.03082.
  • Peng, Z., Wang, R., Liu, L., & Wu, H. (2020). Exploring urban spatial features of COVID-19 transmission in Wuhan based on social media data. ISPRS International Journal of Geo-Information, 9(6), 402.
  • Li, D., Chaudhary, H., & Zhang, Z. (2020). Modeling spatiotemporal pattern of depressive symptoms caused by COVID-19 using social media data mining. International Journal of Environmental Research and Public Health, 17(14), 4988.
  • Chen, N., Zhong, Z., & Pang, J. (2021). An exploratory study of COVID-19 information on twitter in the greater region. Big Data and Cognitive Computing, 5(1), 5.
  • Boon-Itt, S., & Skunkan, Y. (2020). Public perception of the COVID-19 pandemic on Twitter: sentiment analysis and topic modeling study. JMIR Public Health and Surveillance, 6(4), e21978.
  • Onan, A. (2017). Twitter mesajları üzerinde makine öğrenmesi yöntemlerine dayalı duygu analizi. Yönetim Bilişim Sistemleri, 3(2), 1-14.
  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
  • Roche, M. (2020). COVID-19 and Media datasets: Period-and location-specific textual data mining. Data in brief, 33, 106356.
There are 23 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Aytuğ Onan 0000-0002-9434-5880

Publication Date July 31, 2021
Published in Issue Year 2021 Issue: 26 - Ejosat Special Issue 2021 (HORA)

Cite

APA Onan, A. (2021). COVID-19 ile İlgili Sosyal Medya Gönderilerinin Metin Madenciliği Yöntemlerine Dayalı Olarak Zaman-Mekansal Analizi. Avrupa Bilim Ve Teknoloji Dergisi(26), 138-143. https://doi.org/10.31590/ejosat.957020