Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance

Pius Marthın; Duygu İçen

doi:10.5824/ajite.2020.01.001.x

Research Article

Genel İlaç Performansını Tahmin Etmek İçin denetimli Makine Öğrenme Yöntemleriyle Doğal Dil İşleme Uygulaması

Year 2020, Volume: 11 Issue: 40, 8 - 23, 03.05.2020

Pius Marthın Duygu İçen

https://doi.org/10.5824/ajite.2020.01.001.x

https://izlik.org/JA32LT32SS

Abstract

Çevrimiçi ürün incelemeleri, belirli bir ürünle ilgili müşterilerin karar almasını kolaylaştıran değerli bir bilgi kaynağı haline gelmiştir. İlaç şirketleri, ürünlerinin kalitesini artırmak adına kullanıcının memnuniyeti ve belirli bir ilaçla ilgili deneyimleri hakkındaki zengin bilgilerle donatılmış olan çevrimiçi ilaç incelemelerini kullanır. Makine öğrenimi, bilim insanlarının çeşitli alanlarda karar vermeyi kolaylaştıran daha verimli modeller geliştirmelerini sağlamaktadır. Bu makalede UCI makine öğrenimi veri havuzu web sitesinden Gräβer, Kallumadi, Malberg ve Zaunseder (2018) tarafından kullanılan bir ilaç inceleme verisini ele aldık. Amacımız kullanıcıların yaptıkları incelemelerine göre genel ilaç performansının daha iyi tahmin edilmesini sağlayan en iyi makine öğrenme modelini belirlemektir. Model doğruluğunu artırmak için yapılan çeşitli manipülasyonların yanı sıra, metin temizliği ve makine öğrenme modelleri uygulamak için metinlerin sayısal formata dönüştürülmesi dahil olmak üzere metin analizi için gerekli tüm prosedürler izlenmiştir. Modellemeye geçilmeden önce, müşterilerin ilaçlar hakkında yaptıkları incelemeler için genel duygu puanları elde ettik. Müşterilerin yorumları, en sık kullanılan terimleri keşfetmek için bir çubuk grafiği ve kelime bulutu grafiği kullanılarak özetlendi ve görselleştirildi. 161297 gözlemli eğitim verisinden rastgele 15000 gözlem seçtik ve 53766 gözlemli test verisinden 10000 gözlem rastgele seçildi. Çeşitli makine öğrenme modelleri, tabakalı rastgele örnekleme altında gerçekleştirilen 10 kat çapraz doğrulama kullanılarak eğitildi. Eğitim için kullanılan modeller: Sınıflandırma ve Regresyon Ağaçları (CART), C5.0 algoritması, lojistik regresyon (GLM), Çok Değişkenli Uyarlanabilir Regresyon Eğrileri (MARS), Destek vektör makinesinin (SVM) hem radyal hem de doğrusal çekirdekleri ve Rastgele Orman (Random Forest) algoritmalarıdır. Model seçimi doğruluk ve hesaplama verimliliğinin karşılaştırılması yoluyla yapılmıştır. Lineer çekirdekli destek vektör makinesi (SVM), diğerlerine kıyasla% 83 doğrulukla önemli ölçüde en iyi tahmin sonuçlarını vermiştir. Veri kümesinin sadece küçük bir kısmını kullanarak, TF-IDF dönüşümünü ve Latent Semantik Analiz (LSA) ile TDM'imize uygulayarak modellerimizde makul doğruluk elde etmeyi başardık.

Keywords

Terim Belge Matrisi , Makine Öğrenme , Duygu Analizi , Çapraz Doğrulama , Terim Frekansı-Ters belge Frekansı , Gizli Semantik Analiz

References

Bhargava, Apurva, (2019). Grouping of Medicinal Drugs Used for Similar Symptoms by Mining Clusters from Drug Benefits Reviews. Available at SSRN: https://ssrn.com or http://dx.doi.org/10.2139/ssrn.3356314
Denecke, K., Deng, Y, (2015). Sentiment analysis in medical settings: new opportunities and challenges. Artif. Intell. Med. 64(1), 17–27.
Gräβer, F., Kallumadi, S., Malberg, H., & Zaunseder, S. (2018). Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. Proceedings of the 2018 International Conference on Digital Health - DH ’18. doi:10.1145/3194658.3194677 https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29
IBM Corporation, (2013). Data-driven healthcare organizations use big data analytics for big gains. Somers, NY: IBM Corporation.
Jimene-Zafra, S.M., Martín-Valdivia, M.T, Urena-Lopez, L.A., (2019). How do we talk about doctors and drugs? Sentiment analysis in forums expressing opinions for the medical domain. Artificial Intelligence in Medicine 93, 50–57. doi: 10.1016/j.artmed.2018.03.007
Kerstin Denecke, (2015). Sentiment Analysis from Medical Texts. Springer International Publishing, Cham, 83–98. https://doi.org/10.1007/978-3-319-20582 3_10
Kho S.J., Padhee S., Bajaj G., Thirunarayan K., Sheth A. (2019). Domain-Specific Use Cases for Knowledge-Enabled Social Media Analysis. In: Agarwal N.,
Dokoohaki N., Tokdemir S. (eds) Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining. Lecture Notes in Social Networks. Springer, Cham.
Korkontzelos, Ioannis & Nikfarjam, Azadeh & Shardlow, Matthew & Sarker, Abeed & Ananiadou, Sophia & Gonzalez, Graciela. (2016). Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts. Journal of Biomedical Informatics. 62. 10.1016/j.jbi.2016.06.007
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284
Liu, C., Sheng, Y., Wei, Z., & Yang, Y.-Q. (2018). Research of Text Classification Based on Improved TF-IDF Algorithm. 2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE). doi:10.1109/irce.2018.8492945
Liu B., Zhang L. (2012) A Survey of Opinion Mining and Sentiment Analysis. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA.
Luo C., Zhan J., Xue X., Wang L., Ren R., Yang Q. (2018). Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks. In: Kůrková
V., Manolopoulos Y., Hammer B., Iliadis L., Maglogiannis I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science, vol 11139. Springer, Cham.
M. Pirmohamed, S. James, Meakin, C. Green, A.K Scott, T.J Walley, K. Farrar, B.K. Park, A.M. Breckenridge, (2004). Adverse drug reactions as a cause of admission to hospital: prospective analysis of 18820 patients. BMJ, 329(7456)15-19. Doi:10.1136/bmj.329.7456.15
Saif Mohammad and Peter Turney, 2010. Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon. In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, LA, California.
T. Al-Moslmi, N. Omar, S. Abdullah, and M. Albared, (2017). Approaches to Cross-Domain Sentiment Analysis: A Systematic Literature Review, IEEE Access, vol. 5, pp. 16173-16192. doi: 10.1109/ACCESS.2017.2690342

Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance

Year 2020, Volume: 11 Issue: 40, 8 - 23, 03.05.2020

Pius Marthın Duygu İçen

https://doi.org/10.5824/ajite.2020.01.001.x

https://izlik.org/JA32LT32SS

Abstract

Online product reviews have become a valuable source of information which facilitate customer decision with respect to a particular product. With the wealthy information regarding user's satisfaction and experiences about a particular drug, pharmaceutical companies make the use of online drug reviews to improve the quality of their products. Machine learning has enabled scientists to train more efficient models which facilitate decision making in various fields. In this manuscript we applied a drug review dataset used by (Gräβer, Kallumadi, Malberg,& Zaunseder, 2018), available freely from machine learning repository website of the University of California Irvine (UCI) to identify best machine learning model which provide a better prediction of the overall drug performance with respect to users' reviews. Apart from several manipulations done to improve model accuracy, all necessary procedures required for text analysis were followed including text cleaning and transformation of texts to numeric format for easy training machine learning models. Prior to modeling, we obtained overall sentiment scores for the reviews. Customer's reviews were summarized and visualized using a bar plot and word cloud to explore the most frequent terms. Due to scalability issues, we were able to use only the sample of the dataset. We randomly sampled 15000 observations from the 161297 training dataset and 10000 observations were randomly sampled from the 53766 testing dataset. Several machine learning models were trained using 10 folds cross-validation performed under stratified random sampling. The trained models include Classification and Regression Trees (CART), classification tree by C5.0, logistic regression (GLM), Multivariate Adaptive Regression Spline (MARS), Support vector machine (SVM) with both radial and linear kernels and a classification tree using random forest (Random Forest). Model ion was done through a comparison of accuracies and computational efficiency. Support vector machine (SVM) with linear kernel was significantly best with an accuracy of 83% compared to the rest. Using only a small portion of the dataset, we managed to attain reasonable accuracy in our models by applying the TF-IDF transformation and Latent Semantic Analysis (LSA) technique to our TDM.

Keywords

Term Document Matrix (TDM) , Machine Learning , Sentiment Analysis , Cross-Validation , Term Frequency-Inverse Document Frequency (TF-IDF) , Latent Semantic Analysis (LSA)

References

Bhargava, Apurva, (2019). Grouping of Medicinal Drugs Used for Similar Symptoms by Mining Clusters from Drug Benefits Reviews. Available at SSRN: https://ssrn.com or http://dx.doi.org/10.2139/ssrn.3356314
Denecke, K., Deng, Y, (2015). Sentiment analysis in medical settings: new opportunities and challenges. Artif. Intell. Med. 64(1), 17–27.
Gräβer, F., Kallumadi, S., Malberg, H., & Zaunseder, S. (2018). Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. Proceedings of the 2018 International Conference on Digital Health - DH ’18. doi:10.1145/3194658.3194677 https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29
IBM Corporation, (2013). Data-driven healthcare organizations use big data analytics for big gains. Somers, NY: IBM Corporation.
Jimene-Zafra, S.M., Martín-Valdivia, M.T, Urena-Lopez, L.A., (2019). How do we talk about doctors and drugs? Sentiment analysis in forums expressing opinions for the medical domain. Artificial Intelligence in Medicine 93, 50–57. doi: 10.1016/j.artmed.2018.03.007
Kerstin Denecke, (2015). Sentiment Analysis from Medical Texts. Springer International Publishing, Cham, 83–98. https://doi.org/10.1007/978-3-319-20582 3_10
Kho S.J., Padhee S., Bajaj G., Thirunarayan K., Sheth A. (2019). Domain-Specific Use Cases for Knowledge-Enabled Social Media Analysis. In: Agarwal N.,
Dokoohaki N., Tokdemir S. (eds) Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining. Lecture Notes in Social Networks. Springer, Cham.
Korkontzelos, Ioannis & Nikfarjam, Azadeh & Shardlow, Matthew & Sarker, Abeed & Ananiadou, Sophia & Gonzalez, Graciela. (2016). Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts. Journal of Biomedical Informatics. 62. 10.1016/j.jbi.2016.06.007
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284
Liu, C., Sheng, Y., Wei, Z., & Yang, Y.-Q. (2018). Research of Text Classification Based on Improved TF-IDF Algorithm. 2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE). doi:10.1109/irce.2018.8492945
Liu B., Zhang L. (2012) A Survey of Opinion Mining and Sentiment Analysis. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA.
Luo C., Zhan J., Xue X., Wang L., Ren R., Yang Q. (2018). Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks. In: Kůrková
V., Manolopoulos Y., Hammer B., Iliadis L., Maglogiannis I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science, vol 11139. Springer, Cham.
M. Pirmohamed, S. James, Meakin, C. Green, A.K Scott, T.J Walley, K. Farrar, B.K. Park, A.M. Breckenridge, (2004). Adverse drug reactions as a cause of admission to hospital: prospective analysis of 18820 patients. BMJ, 329(7456)15-19. Doi:10.1136/bmj.329.7456.15
Saif Mohammad and Peter Turney, 2010. Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon. In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, LA, California.
T. Al-Moslmi, N. Omar, S. Abdullah, and M. Albared, (2017). Approaches to Cross-Domain Sentiment Analysis: A Systematic Literature Review, IEEE Access, vol. 5, pp. 16173-16192. doi: 10.1109/ACCESS.2017.2690342

There are 17 citations in total.

Details

Primary Language	English
Journal Section	Research Article
Authors	Pius Marthın This is me 0000-0003-3529-0311 Duygu İçen This is me 0000-0002-7940-5064
Submission Date	February 26, 2020
Publication Date	May 3, 2020
DOI	https://doi.org/10.5824/ajite.2020.01.001.x
IZ	https://izlik.org/JA32LT32SS
Published in Issue	Year 2020 Volume: 11 Issue: 40

Cite

APA	Marthın, P., & İçen, D. (2020). Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance. AJIT-E: Academic Journal of Information Technology, 11(40), 8-23. https://doi.org/10.5824/ajite.2020.01.001.x

Article Files

Full Text

You can contact +90 216 355 59 19 number.

Articles in this journal is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.