Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance

Online product reviews have become a valuable source of information which facilitate customer decision with respect to a particular product. With the wealthy information regarding user's satisfaction and experiences about a particular drug, pharmaceutical companies make the use of online drug reviews to improve the quality of their products. Machine learning has enabled scientists to train more efficient models which facilitate decision making in various fields. In this manuscript we applied a drug review d ataset used by (Gräβer, Kallumadi, Malberg,& Zaunseder, 2018), available freely from machine learning repository website of the University of California Irvine (UCI) to identify best machine learning model which provide a better prediction of the overall drug performance with respect to users' reviews. Apart from several manipulations done to improve model accuracy, all necessary procedures required for text analysis were followed including text cleaning and transformation of texts to numeric format for easy training machine learning models. Prior to modeling, we obtained overall sentiment scores for the reviews. Customer's reviews were summarized and visualized using a bar plot and word cloud to explore the most frequent terms. Due to scalability issues, we were able to use only the sample of the dataset. We randomly sampled 15000 observations from the 161297 training dataset and 10000 observations were randomly sampled from the 53766 testing dataset. Several machine learning models were trained using 10 folds cross-validation performed under stratified random sampling. The trained models include Classification and Regression Trees (CART), classification tree by C5.0, logistic regression (GLM), Multivariate Adaptive Regression Spline (MARS), Support vector machine (SVM) with both radial and linear kernels and a classification tree using random forest (Random Forest). Model selection was done through a comparison of accuracies and computational efficiency. Support vector machine (SVM) with linear kernel was significantly best with an accuracy of 83% compared to the rest. Using only a small portion of the dataset, we managed to attain reasonable accuracy in our models by applying the TF-IDF transformation and Latent Semantic Analysis (LSA) technique to our TDM.


INTRODUCTION
Pharmaceutical companies ensure the safety of their products depending mostly on clinical trials and specific test protocols used to test drug effectiveness. Due to a limited number of test subjects and time span, high variations and biases in patient selection may be inevitable for such kinds of studies (Gräβer et al, 2018). Consequently, a significant impact on the effectiveness of the drug and unexpected adverse drug reactions may occur.
According to the study conducted by (Pirmohamed, James, Meakin, Green, Scott, Walley, Farrar, Park, & Breckenridge), adverse Drug Reactions (ADRs) is one of the major public health issues and one of the leading causes of morbidity and mortality. Korkontzelos, Ioannis, Nikfarjam, Azadeh, Shardlow, Matthew, Sarker, Abeed, Ananiadou, Sophia, Gonzalez &, Graciela, (2016) found that although the efficiency and safety of drugs are tested during clinical trials, many ADRs remain latent and may only be revealed under specific cases such as: after long-term use, when used in combination with other drugs, or when used by patients who were excluded from the trials such as adults with other morbidities, children, the elderly or pregnant women. Therefore, the use of systematic drug reviews that aggregate the available information in a neutral manner is very essential in order to uplift customer satisfaction, achieve business objectives and improve community health in general. Procedures that lead to ideal personalized treatment options for a given patient and time specifically depend on structured data (Gräβer et al, 2018). The amount of such data often appears to be limited as it requires intense preparation which is not usual in clinical routine and therefore other targets of information such as user reviews are of great demand (Gräβer et al, 2018). With the rapid growth of social media on the Web, individuals and organizations are increasingly using public opinions in these media for their decision making (Liu and Zhang, 2012). Although the Accessibility of all-important data from an unstructured source is a challenge, it can significantly increase the healthcare practitioners' knowledge of the patient if the information embedded in these sources can be exposed (IBM Corporation, 2013).
Sentiments analysis for opinions presented via medical platforms provides significant usefulness in decision making concerning public health (Gräβer et al, 2018). Positive and negative effects of a treatment can be assessed for clinical evidence; relations between symptoms, lifestyle and effectiveness can also be studied (Gräβer et al, 2018). Information on the health status and psychological status of a patient can be collected for example by analyzing information generated within a patient-doctor social network (Kerstin, 2015).
Texts obtained from other social networks; opinions may be conveyed through facts that are interpretable by emotions they convey (Denecke and Deng 2015). In a similar manner we can compare to sentiment analysis in the healthcare domain where sentiment are fetched through diseases, treatments or medical conditions and their impact on a patient's life quality and health status (Denecke and Deng 2015). However, users of online medical platforms express their views in a unique manner as the language used to comment on a particular drug or medication differs much from other usual platforms such as sports, business, etc. This imposes a limitation in applying sentiment analysis using typical lexicons.

AJIT-e: Bilişim Teknolojileri Online
In most of the existing literature, the sentiment is often taken as polarity, i.e. positive, negative or neutral polarity towards some subject (Denecke and Deng 2015). In contrast to products or persons where sentiment mainly comprises of like or dislike towards a person or product, opinions or sentiments towards medications, treatments or even diagnoses sentiments have even more facets and are expressed in different words (Denecke and Deng 2015).
Accordingly, alternative procedures that consider the problem as either classification or regression may be carried out where machine learning can be applied to provide possible solutions. Machine learning techniques can appropriately used to train classifiers on domainspecific data sets to detect the polarity at sentence or document level and performing sentiment analysis over multiple facets of issues (Gräβer et al, 2018).
Therefore, using machine learning as an alternative remedy, several studies on analyzing online drug reviews from different medical platforms have been conducted including but not limited to (Jimene, Martín, & Urena, 2019), who applied supervised learning and lexicon-based sentiment analysis approach over two different corpora extracted from social web specifically focused on drugs and doctors, (Kho, Padhee, Bajaj, Thirunarayan, & Sheth, 2019) discussed the need to go beyond data-driven machine learning and natural language processing and incorporate deep domain knowledge, (Bhargava, 2019), applied the k-means clustering algorithm on a textual dataset of unlabeled reviews of medicinal drugs in order to group the drugs with similar usage and benefits, (Gräβer et al, 2018), performed multiple tasks over drug reviews with data obtained by crawling online pharmaceutical review sites (same dataset applied in this paper) to perform sentiment analysis to predict the sentiments concerning overall satisfaction, side effects and effectiveness of user reviews on the specific drug.
In this manuscript, we apply the drug review dataset used by (Gräβer et al, 2018) available freely from from machine learning repository website of the University of California Irvine (UCI) to perform sentiment analysis on drug reviews in order to identify the best machine learning model which provides a better prediction of the overall drug performance with respect to users' reviews. In this study, we apply Latent Semantic Analysis (LSA) to select few most important predictive features and penalize mostly frequent terms using TF-IDF transformation to attain reasonable accuracy for our models in the most efficient manner using only a portion of the dataset.
The rest of the manuscript is organized as follows: in section 2 we discuss the dataset used to train our machine learning models, section 3 covers material and methods, section 4 includes

DATASET
The drug data set was created by (Gräβer et al, 2018) and is available freely from the machine learning repository website of the University of California Irvine (UCI). The texts files are downloaded containing both training and testing datasets. The datasets consisted of 6 features which defines drug name, patient condition, patient review (text), ratings (10-star patient rating), review date and number of users who found the review useful (For more information about the data set please see machine learning repository website of the University of California Irvine (UCI) with the link provided in the reference list).
Due to scalability issues we are able to use only the sample of the dataset. We randomly sample 15000 observations from 161297 training dataset and 10000 observations are randomly sampled from 53766 testing dataset. For the purpose of this study, we select two features including reviews (text) and ratings. We create our target variable which represents overall drug performance (binary) by converting ratings into a factor and redefining its levels as high if it has 6 or higher star-patient rating score and low otherwise.

MATERIAL and METHODS
All necessary procedures for text analytics are applied to clean the corpus (reviews collections) includes removal of (URL, stop-words, punctuations, white space), converting to lowercase, steaming the document and finally converting to the term-document matrix (TDM). We also obtain a data frame consisting of terms (words) with their respective frequencies to be used for word clouds and bar plot of most frequent terms in the corpus together with other necessary computations such as term frequency-inverse document frequency (TFIDF) transformation.
For feature space, both unigram and bi-gram cases are considered but no improvement in the model accuracy through bi-gram is achieved and therefore we rely completely on unigram models. Afterward, we also engineer two new features by utilizing review length and cosine similarities respectively. The new feature due to review length does not produce any improvement to our model's accuracy and hence it will not be used.
A new feature with respect to cosine similarity is computed under the hypothesis that lowrated drugs have low cosine similarities with highly rated drugs or vice versa. Cosine similarity calculates similarity by measuring the cosine of the angle between two vectors.
Given the two vectors A and B, cosine similarity is given with Equation 1. (1)

AJIT-e: Bilişim Teknolojileri Online
In the equation above the numerator is the usual dot product and the denominator is the Euclidean distances or magnitude for the two vectors as suggested in the study conducted by (Luo, Zhan, Xue , Wang , Ren, & Yang , 2018). This formula is applied to our term frequency matrix (TF) to compute the new predictor.
We then apply the Latent Semantic Analysis (LSA) to extract 300 most influential features. LSA is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. It is an information retrieval technique that analyzes and identifies the pattern in an unstructured collection of text and the relationship between them through singular value decomposition (SVD) (Landauer, Foltz, & Laham, 1998).
Further, we standardize our term-document matrix (TDM) through term frequency-inverse document frequency (TFIDF) transformation in order to penalize most frequent terms. TF-IDF scores are computed as a product of term frequency (TF) and inverse document frequency (IDF) for specific words. Therefore, the score of any word in any document (review) can be

RESULTS
Prior to modeling, an unsupervised machine learning approach using a high-quality,

moderate-sized emotion lexicon developed by Saif Mohammad and Peter Turney (2010) is
conducted to obtain overall sentiment scores for the reviews of the drug as summarized in sentiments. Further, the score for negative sentiment is higher than that of positive sentiment.
Due to limitations of sentiment analysis on medical reviews as discussed in section 1, we cannot generalize on drug performance strictly based on sentiment scores.
We also explored most frequently terms that occurred in our document matrix and summary results are shown in Figure 2 below.

Figure 2. Most frequent terms in the corpus
Like as shown from Figure 2, most frequently words such as 'day', 'take', 'month' etc. will be penalized using TF-IDF transformations before training our machine learning models to avoid overfit problems since they are less informative on classifying newly incoming data.
We also visualize our term-document matrix (TDM) by constructing a word cloud. As shown in Figure 3 below, most frequently terms are much bigger in size as compared to less frequently terms. Like discussed above, large-sized words in the cloud represent the most frequent words which must be penalized to improve model accuracy.

Figure 3. Word cloud for the drug reviews
After completing all necessary manipulations on our text document as explained in section 3, five different machine learning models are trained to predict the overall drug's performance concerning users' reviews. The final data frame consists of 302 features of which 300 are the most important predictors obtained through LSA and two more features are engineered concerning review length and cosine similarity respectively. The new feature engineered concerning review length did not add any value and hence it was discarded.
Models are trained using 10 fold cross-validation through a stratified sampling approach to preserve the balance in the levels of our target variable. Our binary target variable which represents overall drug performance is modeled using 301 predictors. Models trained include The tables below provide summary results for our model-fitting parameters. From tables 1.0 above, random forest model has higher accuracy 84% followed by SVM with radial kernel with an accuracy of 83%, logistic regression model (GLM) with an accuracy of 83%, SVM with linear kernel with an accuracy of 83%, MARS with an accuracy of 82%, C5.0

AJIT-e: Bilişim Teknolojileri Online
with an accuracy of 81%, and CART with an accuracy of 80% appeared to be the least performed model in predicting overall drugs performance.  Therefore, we performed a Bonferroni test to analyze the significant differences among the fitted models. The table below provides summary results of the test.   Therefore, results from machine learning models show the benefit of applying unstructured data (user reviews) to predict overall drug performance. Although we utilized only a sample of a dataset due to scalability issues, through Latent Semantic Analysis (LSA) and TF-IDF transformation we were able to train machine learning models with reasonable accuracies.
Besides, the random forest has achieved the best accuracy of 84% to predict new drugs as either low-rated or highly-rated based on the reviews provided by users nevertheless SVM with the linear kernel which attained maximum accuracy of 83% has been selected due to its simplicity and computational efficiency.

CONCLUSION
From the above discussion, supervised machine learning models provide a great remedy in predicting overall drug performance using unstructured textual data instead of completely relying on sentiment scores. Using only a small portion of the dataset, we managed to attain reasonable accuracy in our models by applying TF-IDF transformation to penalizes most frequent terms and Latent Semantic Analysis (LSA) technique to select few powerful predictive features. Further, the classification model by random forest appeared to be superior compared to all models considered in this study with an accuracy of 84% yet the SVM model with linear kernel was selected due to its simplicity and computational efficiency. Finally, we propose a future similar study to compare various features selection techniques such as Latent Semantic Analysis (LSA), Principle Components Analysis (PCA), Partial Least Square (PLS), Chi-Square method, Information Gain Ratio technique, and other methods found in the literature to analyze texts from the medical field domain using supervised machine learning approach.