Drug Solubility Prediction: A Comparative Analysis of GNN, MLP, and Traditional Machine Learning Algorithms

evaluates drug solubility prediction models


INTRODUCTION (GİRİŞ)
In today's world, drug discovery and development represent paramount domains in the pharmaceutical industry and medical research.In this intricate process, the accurate anticipation of drug molecule solubility assumes a pivotal significance.The solubility of a drug molecule is considered a fundamental parameter in drug design and formulation.Solubility determines how well a compound can dissolve in water or another solvent, which can impact the pharmacokinetics, bioavailability, and toxicity of the drug.Therefore, accurately predicting the solubility of a drug candidate during its development stage is essential for the early detection and resolution of potential issues [1,2].
Over the years, the landscape of drug solubility prediction has transitioned from conventional methodologies to embrace advanced techniques rooted in artificial intelligence (AI) and machine learning (ML).These technological advancements have provided access to vast amounts of molecular data, enabling the development of new approaches in solubility prediction.While traditional methods attempt to predict drug solubility using mathematical equations based on the chemical and physical properties of the compound, AI and ML offer a more flexible and data-driven approach, capable of recognizing complex molecular interactions and patterns [2,3].
In this context, Graph Neural Networks (GNNs) have attracted considerable attention.GNNs exhibit success in handling graph-structured datasets across diverse domains and under various learning paradigms, including supervised, semi-supervised, self-supervised, and unsupervised settings.The majority of graph-based methodologies fall within the domain of unsupervised learning, frequently relying on Auto-encoders, contrastive learning, or concepts related to random walks.
GNNs differ from traditional neural networks in that they are specifically designed to operate on graph structures rather than sequences.The use of graphs has experienced significant growth and recognition in recent years, primarily due to their remarkable ability to effectively represent complex real-world problems characterized by interconnections.These applications include structured data where information is used in unstructured formats such as test cases.Furthermore, graphs have proven valuable in modeling a variety of domains, including social networks, molecular structures, web link data, and more, enabling extensive analysis and interpretation.GNNs have proven their efficacy in image analysis tasks such as image segmentation and object detection by employing graphs as representations for images.In conclusion, the use of GNNs allows for a specialized approach to processing graph-structured data, facilitating improved performance and results in a variety of domains where interconnectedness and structured relationships are paramount considerations.
GNNs find application across a diverse spectrum of tasks and domains, including but not limited to network embedding, graph classification, node classification, spatial-temporal graph forecasting, and graph generation.Their utility spans a wide array of activities and fields.Their adaptability positions GNNs as pivotal tools for tackling intricate problems characterized by relational structures and dependencies.GNNs utilize a graphbased approach to represent and analyze molecular structures.This approach represents molecular structures as graphs and can make solubility predictions by analyzing these graphs.GNNs can contribute to a better understanding of molecular interactions and accelerate drug design [4,5].
However, traditional regression models and conventional ML algorithms are still considered effective tools in this field.Traditional machine learning methodologies, particularly exemplified by Multilayer Perceptron (MLP), have exhibited their efficacy in the domain of solubility predictions and are widely adopted within the pharmaceutical industry [6].Additionally, prominent ML algorithms such as XGBoost, Gradient Boosting (GB), and Random Forest (RF) have demonstrated robust predictive capabilities, particularly on extensive molecular datasets [7].
This study aims to compare AI-based models with traditional ML methods in drug solubility prediction.Specifically, we will evaluate the performance of GNNs and MLP in predicting drug solubility.We will also examine the role of popular ML algorithms like RF, GB, and XGBoost in this context.This research could guide which model performs best in drug design and development processes and contribute to future drug discovery studies.

Literature Review (Literatür Taraması)
The prediction of drug solubility is widely recognized as a longstanding challenge in the pharmaceutical industry, leading to extensive research in this field.Traditional methods utilize various mathematical equations and rules to predict drug solubility by considering factors such as chemical groups in molecular structure, bond lengths, and atomic charges.However, these methods often achieve limited success in predicting the solubility of individual drug molecules.
In recent epochs, substantial advancements have materialized within the purview of computational modeling, with a specialized emphasis on the intricate task of predicting the aqueous solubility of diverse substances.Due to the limitations of traditional methods, AI and machine learning techniques have become more appealing for drug solubility prediction.These techniques offer greater flexibility in analyzing large datasets and identifying complex patterns in molecular interactions.Specifically, AI methods such as deep learning and GNNs have been employed to better represent molecular structures and improve solubility predictions [8].Various prediction techniques, including Multiple Linear Regression (MLR), Principal Component Analysis (PCA), Partial Least Squares (PLS), Artificial Neural Network (ANN), K-Nearest Neighbors (k-NN), Support Vector Machine (SVM), and RF, are utilized to forecast the properties of molecules.These predictions rely on descriptors that capture the characteristics of chemical structures [9][10][11].
GNNs represent molecular structures using a graphbased approach, modeling atoms as nodes and chemical bonds as edges.GNNs play an effective role in a range of applications, from predicting drugprotein binding values to analyzing drug similarities, predicting drug side effects without extracting drug scaffolds, and more.The use of GNNs has emerged as a significant tool to accelerate the drug discovery process, reduce costs, and discover new drug candidates by allowing for a more detailed analysis of molecular interactions [12].In contrast, traditional regression models like MLP represent molecular features as vectors and make predictions using these feature vectors [13].
The potential of GNNs in drug solubility prediction has been explored in several recent studies.In one investigation, diverse deep learning models were developed for solubility prediction.Four discrete GNN models were postulated to encapsulate molecular representation, and among them, the AttentionFP model showcased noteworthy superiority in performance [14].Simultaneously, an innovative Multi-Ordered Graphical Attention Network (MoGAT) was introduced as an advanced framework for the prediction of solubility.The primary objectives of this proposition were to enhance prediction performance and facilitate the explication of the predicted results.Findings indicated that MoGAT outperformed contemporary methods in terms of performance and demonstrated the compatibility of predicted results with established chemical knowledge [15].In a different investigation, a novel graph framework was proposed to anticipate the water solubility of pharmaceutical compounds.This conceptual framework introduced a distinct GNN model named ALIGNN, explicitly tailored for the QM9 dataset [16].
The collective findings from these studies suggest that GNNs have the potential to enhance solubility predictions by offering a more nuanced understanding of molecular interactions.Nevertheless, traditional regression models such as MLP continue to be acknowledged as effective tools in this domain and are widely utilized by various pharmaceutical companies.For example, one research project employed multiple machine learning algorithms (MLR, ANN, RF, ET, and SVM) to predict drug solubility in different solvents [17].In a separate investigation, Kernel Ridge Regression (KRR), Least Angle Regression (LAR), and MLP were applied to forecast the solubility of Lenalidomide, a drug used in the treatment of specific bone marrow-related conditions in adults [18].Another study focused on developing an AIbased model using a SVM to examine the solubility data of the drug Busulfan [19].Additionally, three distinct machine learning approaches, namely k-NN, MLP, and KRR, were employed to predict the solubility of the drug Nystatin [20].
Additionally, popular machine learning algorithms such as RF, GB, and XGBoost have been employed for drug solubility prediction, yielding successful results.These algorithms are considered valuable tools for making predictions on large molecular datasets and are deemed necessary for drug solubility prediction [21].

2.MATERIALS AND METHODS (MATERYAL VE METOD)
This research endeavors to undertake a comparative analysis of diverse ML and AI methodologies concerning the prediction of drug solubility.In pursuit of this objective, an assembly of drug data is initially amassed, and various modeling strategies are scrutinized through the lens of this dataset.
The drug dataset includes the chemical properties, structures, and solubility values of various drug molecules.This dataset is obtained from a database widely used in drug design and development processes and encompasses a variety of drug molecules.
Various methodologies are employed in the prediction of drug molecule solubility, encompassing models such as GNNs, MLP, RF, GB, and XGBoost.These models employ different mathematical and graphical approaches to represent the molecular features and structures of drugs.
The methods of the study encompass the preparation and preprocessing of the drug dataset as the initial steps.Subsequently, the training of various models and their utilization for solubility predictions are executed.The results are then employed to compare the performance of different models, ultimately determining which model is the most effective for drug solubility prediction.Figure 1 diagram shows the flowchart of the proposed method.This study utilizes the "ESOL (Estimated Aqueous Solubility)" dataset, which is employed for predicting the solubility of drugs.The ESOL dataset serves as a data source created with the purpose of measuring and estimating the aqueous solubility of various chemical compounds.This dataset finds application in drug design, chemical analysis, pharmacokinetic investigations, and molecular modeling studies [22,23].
The ESOL dataset comprises 1,125 chemical compounds, each with its solubility provided as a logarithmically transformed value.These values represent solubility predictions in micromolars.Additionally, it includes independent variables representing the chemical structure and molecular properties of each compound.These molecular properties constitute the fundamental data used for solubility predictions [22,23].This dataset is employed to contribute to the understanding of critical pharmaceutical parameters such as the bioavailability, pharmacokinetics, and toxicity of drugs.Furthermore, it is extensively investigated to comprehend how artificial intelligence-based regression models and deep learning techniques can be utilized to expedite drug design processes.
Regarded as a pivotal asset within the pharmaceutical industry, chemical research, and molecular modeling domains, this dataset holds substantial importance.The objective of this study is to predict solubility utilizing the ESOL dataset, with the anticipation that these predictions will provide invaluable insights for the design and development of pharmaceutical compounds.
The esol dataset consists of 10 columns and 1128 rows.

Graph Neural Networks (Grafik Sinir Ağları)
GNNs represent an artificial intelligence approach that graphically depicts the molecular structures of chemical compounds and utilizes deep learning techniques to analyze these graphical structures.The molecular graph representations of chemical compounds encompass the bonds between atoms and molecular properties.GNNs possess the ability to make solubility predictions by processing this graphical structure [24].
GNNs offer several crucial features that assist in a better understanding of the molecular properties and chemical structures of chemical compounds.In particular, GNNs can make solubility predictions by considering the local structure of the molecular graph.This is highly valuable in evaluating the interactions and bonds between different atoms in chemical compounds.Additionally, they can take into account the topology and structural characteristics of chemical compounds.
Within the framework of this investigation, the pivotal function of GNNs is to augment and refine the regression model deployed for the anticipation of chemical compound solubility within the confines of the ESOL dataset.This model generates solubility predictions by utilizing the molecular graph representations of chemical compounds.
These predictions can contribute to the comprehension and improvement of critical parameters in drug design processes, such as the bioavailability and pharmacokinetics of drugs.
The integration of GNNs in this study serves the overarching purpose of catalyzing the evolution of pioneering solubility prediction methodologies within the spheres of pharmaceuticals and chemistry.This strategic inclusion aims to expedite and enhance the intricate processes involved in drug design.Consequently, a GNN-based solubility prediction model can serve as a valuable tool in drug development processes and assist in the more effective and safe design of drugs.Figure 2 shows the working structure of a GNN.

Multilayer Perceptron (MLP) (Çok Katmanlı
Algılayıcı) Within the scope of this study, the MLP is employed as a deep learning model derived from artificial neural networks.MLP is a widely used regression and classification model in the fields of data mining and pattern recognition.It is known for its capability to learn complex functions and is particularly successful in prediction problems involving structured and unstructured data types.
The multi-layered architecture is a distinctive feature of MLP.MLP architecture comprises no fewer than three essential layers: an initial inputlayer, followed by one or more hidden-layers, and culminating in output-layer.Neurons within these layers are fully interconnected, establishing connections with every neuron in both the preceding and succeeding layers.This structural characteristic augments the model's capacity to adeptly represent and learn intricate functions [6].
Every neuron undertakes the processing of incoming data through the application of an activation function.Frequently employed activation functions encompass sigmoid, Rectified Linear Unit (ReLU), and tanh.The selection of activation functions plays a crucial role in determining the outputs of neurons, facilitating the model's capacity to acquire and comprehend non-linear relationships.
Training MLP is typically performed using a process called backpropagation.This process involves comparing the model's predictions to actual values and propagating errors arising from this comparison backward between layers.This allows for the updating of weights and the training of the model.Figure 3 illustrates the structure of an MLP.This study encompasses a diverse array of ML algorithms utilized in the prediction of drug molecule solubility.These algorithms are employed in the fields of data mining and pattern recognition with the purpose of accomplishing the critical task of solubility prediction, which plays a significant role in drug development processes and drug design.
Here is a brief description of some fundamental ML algorithms utilized in this study: a. Random Forest (RF): RF is an ensemble learning algorithm where multiple decision trees are constructed and combined.Each decision tree evaluates data samples independently and aggregates their results.This approach enhances the overall performance of the model while reducing the risk of overfitting [25].

b. Gradient Boosting (GB):
GB is a technique of ensemble learning where simple models, referred to as weak learners, are combined to create a powerful model.This method employs an iterative process where each new model attempts to correct the errors of the previous ones.GB can create a robust regression model capable of achieving high accuracy [26].

c. XGBoost (Extreme Gradient Boosting):
XGBoost is a variation of GB and is often known for its high-performance and scalability.XGBoost is frequently preferred, especially in structured datasets and regression tasks like solubility prediction in tabular data [27].
These ML algorithms are integral to this study's objective of predicting drug solubility, aiding in the advancement of drug design processes, and contributing to pharmaceutical research.

PERFORMANCE METRICS AND THE RESEARCH FINDINGS (Performans Metrikleri ve
Araştırma Bulguları) Performance metrics are criteria used to assess how close a model's predictions are to the actual values.
The metrics utilized in this study include: 1. Root Mean Square Error (RMSE): RMSE is a measure of how much a model's predictions deviate from the actual values.When calculating RMSE, the difference between each prediction and the actual value is computed, the squares of these differences are averaged, and finally, the square root of this value is taken.RMSE shares the same units as the predicted variable, making it directly interpretable.The mathematical formula is shown in Equation 1.
3. R-squared (R 2 ): R 2 measures how well a model explains the variance in the dataset.R 2 , ranging between 0 and 1, signifies the goodness of fit of the model to the data, where higher values denote a superior fit.As R 2 approaches 1, the model explains the data well.However, R 2 can also be negative, indicating the model's failure to predict the data.The mathematical formula is shown in Equation 3.
These metrics are commonly used in evaluating the performance of regression models.RMSE and MAE determine the accuracy of predictions, while R 2 indicates how well the model fits the dataset.Utilizing these metrics is essential for assessing a model's effectiveness and accuracy, providing fundamental tools for understanding and comparing results.
The analyses were performed in PYTHON environment on a computer with 32 GB RAM and GPU RTX3060 graphics processor.The parameters of the models were chosen as epcoh=50, optimizer= Adam, learning rate= 0.01.The real and predicted solubility graphs are essential tools for evaluating the performance of different models used in drug solubility prediction.This study was conducted to evaluate the performance of different ML algorithms in predicting drug solubility.Solubility prediction holds critical importance in drug development processes and pharmaceutical design.Therefore, accurate and reliable solubility prediction models play a significant role in the success of the pharmaceutical industry.
At the outset of the study, the ESOL dataset was employed.This dataset encompasses experimentally measured solubility values of various drug molecules.In the preprocessing and preparation phase of the dataset, molecular graph features were extracted, and the data was partitioned into training and test sets.
In the first stage of the study, the GNN model was utilized.GNN is a AI-based approach used for drug solubility prediction.However, when the performance of the GNN model was compared to other ML models in this study, it was found to be inferior.Upon examination of performance metrics such as RMSE, MAE, and R², it was observed that the GNN model's predictions contained more errors compared to other models.These results indicate that the GNN model was not effective for solubility prediction in this specific dataset.
In the second stage, the MLP model and traditional ML algorithms (RF, GB, XGBoost) were employed.These models are commonly used ML approaches for regression tasks.The results show that these traditional ML algorithms achieved higher success in solubility prediction.Particularly, algorithms like RF, GB, and XGBoost performed better with lower RMSE and MAE values and higher R² scores, indicating more accurate predictions.These findings suggest that traditional ML algorithms may be more effective than AI-based approaches (such as GNN) in regression tasks like drug solubility prediction.However, these results can vary depending on factors like the dataset, feature engineering, hyperparameter tuning, and model selection.Therefore, further research and consideration of different model structures and data characteristics may be necessary for drug solubility prediction.
In conclusion, this study provides a comparative analysis of ML and artificial intelligence models used in drug solubility prediction.It emphasizes that traditional ML algorithms may perform better for a specific dataset and should be considered in model selection.Accurate solubility predictions can offer a significant advantage in the design and discovery of new drugs in pharmaceutical development processes.

Figure A :
Figure A: Illustrates the schematic representation of the proposed methodology through a flow diagram./ Şekil A: Önerilen metodolojinin şematik gösterimini bir akış diyagramı aracılığıyla göstermektedir.

Figure 4 .
Figure 4. Loss values during training (Eğitim sırasındaki kayıp değerleri) Figure 4 illustrates the loss values during training.After 100 iterations, the training loss value was completed at 0.082.A trained GNN model can be utilized to predict the solubility of new molecules.

Figure 5 4 .
visually represents how well each model predicts the actual solubility and the extent of deviation between their predictions and the actual values.DISCUSSION AND CONCLUSION (Tartışma ve Sonuç)

Table 1
below shows the relevant attribute columns for 19 drugs for this data.

Table 2
demonstrate that the RF model had the lowest error rates compared to other models.This indicates that the RF model predicted solubility more accurately and provided results closer to the actual values.On the other hand, the GNN model exhibited higher error rates compared to other models, suggesting that it made less accurate predictions for this specific dataset and needs improvement.This graph aids in a clear comparison of the performance of each model and helps us understand which model excelled in drug solubility prediction and which one has more room for improvement.This information guides the selection of the appropriate model in drug design and development processes.