Disease Detection From Twitter Data Using Natural Language Processing and Machine Learning
Year 2020,
, 839 - 852, 01.12.2020
Ali Öztürk
,
Üsame Durak
,
Fatma Badıllı
Abstract
In this study, we determined whether the subject of the messages of the twitter users were about a disease and what kind of diseases they were. For this purpose, supervised and unsupervised machine learning algorithms were tested and compared using the features extracted via TF-IDF and BOW methods. Data were collected with Python scripts from Twitter. The Scikit-Learn library which was developed for Python was used to implement the algorithms. The clustering algorithms which are unsupervised methods achieved an accuracy level of %68.60, while the performance of the supervised classification algorithms reached to the accuracy level of %97.48.
References
- Aloise, D., Deshpande, A., Hansen, P., Popat, P., 2009, "NP-hardness of Euclidean sum-of-square clustering", Machine learning, Cilt 75, Sayı 2, ss. 245-248.
- Ambert, K. H., Cohen, A.M., 2009, “A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection”, Journal of
the American Medical Informatics Association, Cilt 16, Sayı 4, ss. 590–595.
- Acherkar, H., Gandhe, A., Lazarus, R., Yu, S., Liu, B., 2011, “Predicting Flu Trends using Twitter Data”,
2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Shanghai,
China, 702-706.
- Cavnar, W. B., Trenkle, J. M., 1994, "N-gram-based text categorization.", Proceedings of SDAIR-94, 3rd
annual symposium on document analysis and information retrieval, Las Vegas, Nevada, A.B.D., 161-
175.
- Conmay, M., Hu, M., Chapman W.W., 2019, “Recent Advances in Using Natural Language Processing to Address Public Health Research Questions Using Social Media and ConsumerGenerated Data”, Yearbook of Medical Informatics, Cilt 28, Sayı 1, ss. 208-217.
- Dai, X., Bikdash, M., 2015, “Hybrid Classification for Tweets Related to Infection with Influenza”,
Proceedings of the IEEE SoutheastCon 2015, Fort Lauderdale, Florida, 1-5.
- Dai, X., Bikdash, M., 2016, “Distance-based Outliers Method for Detecting Disease Outbreaks using
Social Media”, Proceedings of the IEEE SoutheastCon 2015, Norfolk, VA, USA, 1-8.
- Edo-Osagie, O., Iglesia, B.D.L., Lake, I., Edeghere, O., 2020, “A scoping review of the use of Twitter for public health research”, Computers in Biology and Medicine, Available Online, 103770, doi: 10.1016/j.compbiomed.2020.103770.
- Hartigan, J.A., Wong, M. A., 1979, "Algorithm AS 136: A k-means clustering algorithm", Journal of the
Royal Statistical Society, Series C (Applied Statistics), Cilt 28, Sayı 1, ss. 100-108.
- Kohavi, R., 1995, "A study of cross-validation and bootstrap for accuracy estimation and model
selection", IJCAI'95 Proceedings of The 14th International Joint Conference on Artificial Intelligence,
Montreal, Quebec, Canada, 2: 1137-1143.
- Lerman, P.M., 1980, "Fitting segmented regression models by grid search", Journal of the Royal Statistical
Society: Series C (Applied Statistics), Cilt 29, Sayı 1, ss. 77-84.
- Manning, C., Schütze, H., 1999, “Foundations of Statistical Natural Language Processing”, MIT press, Cambridge, MA, A.B.D.
- Morita, M., Maskawa, Aramaki, S., E., 2013, “Comparing Social Media and Search Activity as Social
Sensors for the Detection of Influenza”, 5th International Symposium of Languages in Biology and
Medicine, Tokyo, Japan, 75-79.
- Salton, G., Buckley, C., 1988, "Term-weighting approaches in automatic text retrieval", Information Processing & Management, Cilt 24, Sayı 5, ss. 513-523.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014, "Dropout: a simple way to prevent neural networks from overfitting", The Journal of Machine Learning Research, Cilt 15, Sayı 1, ss. 1929-1958.
- Robertson, S., 2004, "Understanding inverse document frequency: on theoretical arguments for IDF", Journal of Documentation, Cilt 60, Sayı 5, ss. 503-520.
- Rudra, K., Sharma, A., Gaungly, N., Imran, M., 2017, “Classifying Information from microblogs during
epidemics”, Proceedings of the 2017 International Conference on Digital Health, London, United
Kingdom, 104-108.
- Rudra, K., Sharma, A., Gaungly, N., Imran, M., 2018, “Classifying and Summarizing Information from Microblogs During Epidemics”, Information Systems Frontiers, Cilt 20, Sayı 1, ss. 933-948.
- Tavoschi L., Quattrone F., D’Andrea E., Ducange P., Vabanesi M., Marcelloni F., Lopalco P.L., 2020, “Twitter as a sentinel tool to monitor public opinion on vaccination: an opinion mining analysis from September 2016 to August 2017 in Italy”, Human Vaccines & Immunotherapeutics, Available Online, doi: 10.1080/21645515.2020.1714311.
- Zhang, Y., Jin, R., Zhou, Z., 2010, "Understanding bag-of-words model: a statistical
framework", International Journal of Machine Learning and Cybernetics, Cilt 1, Sayı 4, ss. 43-52.
TWİTTER VERİLERİNDEN DOĞAL DİL İŞLEME VE MAKİNE ÖĞRENMESİ İLE HASTALIK TESPİTİ
Year 2020,
, 839 - 852, 01.12.2020
Ali Öztürk
,
Üsame Durak
,
Fatma Badıllı
Abstract
Bu çalışmada twitterdaki kullanıcıların yazmış oldukları mesajların hastalık konulu olup olmadığı ve hastalık türleri tespit edilmiştir. Bu amaçla gözetimli ve gözetimsiz makine öğrenmesi algoritmaları, TF-IDF ve BOW yöntemleri ile çıkarılan özellikler ile denenmiş ve karşılaştırmalar yapılmıştır. Veriler Python betikleri ile twitter üzerinden toplanmıştır. Algoritmaları uygulamak için Python için geliştirilmiş Scikit-Learn kütüphanesi kullanılmıştır. Gözetimsiz olarak verilerin kümelenmesinde %68.60’lık bir başarı elde edilirken, gözetimli algoritmalar ile yapılan sınıflandırmalarda %97.48’lik başarı oranına ulaşılmıştır.
References
- Aloise, D., Deshpande, A., Hansen, P., Popat, P., 2009, "NP-hardness of Euclidean sum-of-square clustering", Machine learning, Cilt 75, Sayı 2, ss. 245-248.
- Ambert, K. H., Cohen, A.M., 2009, “A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection”, Journal of
the American Medical Informatics Association, Cilt 16, Sayı 4, ss. 590–595.
- Acherkar, H., Gandhe, A., Lazarus, R., Yu, S., Liu, B., 2011, “Predicting Flu Trends using Twitter Data”,
2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Shanghai,
China, 702-706.
- Cavnar, W. B., Trenkle, J. M., 1994, "N-gram-based text categorization.", Proceedings of SDAIR-94, 3rd
annual symposium on document analysis and information retrieval, Las Vegas, Nevada, A.B.D., 161-
175.
- Conmay, M., Hu, M., Chapman W.W., 2019, “Recent Advances in Using Natural Language Processing to Address Public Health Research Questions Using Social Media and ConsumerGenerated Data”, Yearbook of Medical Informatics, Cilt 28, Sayı 1, ss. 208-217.
- Dai, X., Bikdash, M., 2015, “Hybrid Classification for Tweets Related to Infection with Influenza”,
Proceedings of the IEEE SoutheastCon 2015, Fort Lauderdale, Florida, 1-5.
- Dai, X., Bikdash, M., 2016, “Distance-based Outliers Method for Detecting Disease Outbreaks using
Social Media”, Proceedings of the IEEE SoutheastCon 2015, Norfolk, VA, USA, 1-8.
- Edo-Osagie, O., Iglesia, B.D.L., Lake, I., Edeghere, O., 2020, “A scoping review of the use of Twitter for public health research”, Computers in Biology and Medicine, Available Online, 103770, doi: 10.1016/j.compbiomed.2020.103770.
- Hartigan, J.A., Wong, M. A., 1979, "Algorithm AS 136: A k-means clustering algorithm", Journal of the
Royal Statistical Society, Series C (Applied Statistics), Cilt 28, Sayı 1, ss. 100-108.
- Kohavi, R., 1995, "A study of cross-validation and bootstrap for accuracy estimation and model
selection", IJCAI'95 Proceedings of The 14th International Joint Conference on Artificial Intelligence,
Montreal, Quebec, Canada, 2: 1137-1143.
- Lerman, P.M., 1980, "Fitting segmented regression models by grid search", Journal of the Royal Statistical
Society: Series C (Applied Statistics), Cilt 29, Sayı 1, ss. 77-84.
- Manning, C., Schütze, H., 1999, “Foundations of Statistical Natural Language Processing”, MIT press, Cambridge, MA, A.B.D.
- Morita, M., Maskawa, Aramaki, S., E., 2013, “Comparing Social Media and Search Activity as Social
Sensors for the Detection of Influenza”, 5th International Symposium of Languages in Biology and
Medicine, Tokyo, Japan, 75-79.
- Salton, G., Buckley, C., 1988, "Term-weighting approaches in automatic text retrieval", Information Processing & Management, Cilt 24, Sayı 5, ss. 513-523.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014, "Dropout: a simple way to prevent neural networks from overfitting", The Journal of Machine Learning Research, Cilt 15, Sayı 1, ss. 1929-1958.
- Robertson, S., 2004, "Understanding inverse document frequency: on theoretical arguments for IDF", Journal of Documentation, Cilt 60, Sayı 5, ss. 503-520.
- Rudra, K., Sharma, A., Gaungly, N., Imran, M., 2017, “Classifying Information from microblogs during
epidemics”, Proceedings of the 2017 International Conference on Digital Health, London, United
Kingdom, 104-108.
- Rudra, K., Sharma, A., Gaungly, N., Imran, M., 2018, “Classifying and Summarizing Information from Microblogs During Epidemics”, Information Systems Frontiers, Cilt 20, Sayı 1, ss. 933-948.
- Tavoschi L., Quattrone F., D’Andrea E., Ducange P., Vabanesi M., Marcelloni F., Lopalco P.L., 2020, “Twitter as a sentinel tool to monitor public opinion on vaccination: an opinion mining analysis from September 2016 to August 2017 in Italy”, Human Vaccines & Immunotherapeutics, Available Online, doi: 10.1080/21645515.2020.1714311.
- Zhang, Y., Jin, R., Zhou, Z., 2010, "Understanding bag-of-words model: a statistical
framework", International Journal of Machine Learning and Cybernetics, Cilt 1, Sayı 4, ss. 43-52.