Kidney X-ray Images Classification using Machine Learning and Deep Learning Methods

—Today, kidney stone detection is performed manually by humans on medical images. This process is time-consuming and subjective as it depends on the physician. This study aims to classify healthy or patient individuals according to the status of kidney stones from medical images using various machine learning methods and Convolutional Neural Network (CNN). We evaluated various machine learning methods such as Decision Trees (DT), Random Forest (RF), Support Vector Machines (SVM), Multilayer Perceptron (MLP), K-Nearest Neighbor (kNN), Naive Bayes (BernoulliNB), and deep neural networks using CNN. According to the experiments, the Decision Tree Classifier (DT) has the best classification result. This method has the highest F1 score rate with a success rate of 85.3% using the S+U sampling method. The experimental results show that the Decision Tree Classifier (DT) is a feasible method for distinguishing the kidney x-ray images.

race. It is more common in men than in women. Kidney stones are thought to be caused by reasons such as lack of physical activity and eating habits. Chronic diseases such as blood pressure, diabetes, and obesity may affect stone formation. After the kidney stone is treated, it may recur and become chronic.
Prevention of kidney stone formation and recurrence is still a significant problem for human health. Impairment of kidney function due to the formation of kidney stones endangers human life. Therefore, early diagnosis of kidney stones is critical. In recent years, machine learning and deep learning approaches have been widely adopted to diagnose diseases thanks to the development of technology. These methods provide a reliable tool for making definitive diagnostic decisions that require long and complex processes, as they shorten the diagnosis time and increase the diagnostic accuracy. Along with deep learning, computer vision can categorize the images, extract the properties of an image and enable the classification of images by predicting them based on the model it creates. Until today, many studies have been conducted in which diseases were diagnosed with the deep learning methods. In a previous study, deep learning methods have been used to detect and classify brain tumors from medical images [2]. In another study, recognizing the pathological characteristics of diabetic patients was provided with deep learning [3]. Besides, deep learning methods were used to diagnose thyroid nodules from ultrasound images [4].

II. RELATED WORK
In medicine, diseases are diagnosed with the experience and knowledge of doctors. The use of an automatic diagnosis system can facilitate the work of doctors. Some studies [2,3,4,5] use deep learning methods to diagnose eye diseases caused by diabetes to help diagnose and classify diseases. In the United States, the prevalence of diabetic retinopathy is approximately 28.5% among individuals that have diabetes, while this rate is 18% in India [5]. Most physicians refer their diabetic patients to the ophthalmologist at regular intervals for retinopathy or macular edema screening, depending on the severity of the disease. Automatic grading of diabetic retinopathy has the benefits of increasing efficiency, reproducibility, and the scope of screening programs. It can improve patient outcomes by reducing access barriers and providing early diagnosis and treatment. A type of Convolutional Neural Network (CNN) named Inception-v3 is generally used to aid image analysis and object detection. In Kidney X-ray Images Classification using Machine Learning and Deep Learning Methods Işıl Karabey Aksakallı, Sibel Kaçdıoğlu and Yusuf Sinan Hanay  K the EyePACS-1 data set, the sensitivity of the algorithm was 97.5%, and the specificity was 93.4%. In the Messidor-2 data set, the sensitivity was 96.1%, and the specificity was 93.9% [3]. In literature, CNN is also used for brain tumor detection. In [2], a deep learning-based brain tumor detection and classification system have been proposed using skull MR images. In the study, ELM-LRF (Local receptive field extreme learning machine) method was proposed for tumor classification. As a result of the experiments, the classification accuracy of MR images is 97.18%. The performance of the proposed method is better than recent studies conducted with commonly used methods such as CNN. In another study, a transfer learning method using the Inception-v3 model, which was previously adapted to medical image analysis, was proposed to classify nodules in the thyroid glands from ultrasound images [4]. By classifying 20 of the 21 FNA malignant glands as malignant, they obtained 95.2% sensitivity and 61.8% specificity values by classifying 21 of the 34 FNA benign glands as benign. Besides, in the external test set (100 gland appearance 50 benign, 50 malignant), 50 FNA classified 47 malignant glands as malignant and obtained 94% sensitivity, 50 FNA classified 28 benign glands and obtained 56% specific values [4]. Today, the fine needle aspiration ((FNA)) method is used when evaluating nodules. Computer-based nodule detection and classification can help doctors avoid unnecessary FNA.
There are many studies on the diagnosis and classification of kidney diseases by machine learning methods. In this study, a synthetic kidney function test (KFT) data set including age, gender, urea, creatinine, and glomerular filtration rate was created for the analysis of kidney disease. The study aims to compare the performance of the two methods under two headings as accuracy and working time by using the information of kidney patients and Support Vector Machine (SVM) and Artificial Neural Network (ANN) to predict four types of kidney disease [6]. Support Vector Machine (SVM) and Artificial Neural Network (ANN) methods were used. Classification accuracies were calculated as SVM 76.32%, ANN 87.70%. In another study, kidney stones are detected from low contrast ultrasound images. Median filter, Gaussian filter, and blunt masking are applied to improve the images. Subsequently, KNN and SVM classification techniques were used for the analysis of kidney stone images. The accuracy of the KNN classifier was found to be 89%, and the accuracy of the SVM classifier to 84% [7]. In a similar study, classification of kidney disease (stone or tumor) and segmentation was provided on ultrasound images [8]. Artificial neural networks are proposed for classification and multi-core k-means algorithm for segmentation. A median filter was used to remove noise in ultrasound images. GLCM (Level Co-occurrence Matrix) features were removed from each image after the noise was removed. As a result, it is seen that the system proposed as linear + quadratic-based segmentation has reached a maximum accuracy of 99.61% when compared to all other methods. Besides, the type of these stones is also important for treatment. To determine the type of stones, the type of kidney stone was classified from endoscopic video images with a deep learning network trained with digital photographs of five types of kidney stone components. This classification aims to automatically determine the laser energy settings manually adjusted according to the kidney stone component and size [9]. use a convolutional neural network to classify kidney stone type. In addition, the positions, shapes and sizes of kidney stones are different from each other. Therefore, kidney stone segmentation with machine learning is challenging. In the literature, preprocessing studies have been carried out to reduce this difficulty. In a study, a preprocess algorithm was developed for kidney stone detection and segmentation from CT images [10]. Three thresholding algorithms based on density, size, and location were applied to extract unrelated organ and bone structures from the images. CT images of 30 patients were studied. As a result, a 95.24% sensitivity value was obtained with the proposed algorithm [10]. In another study [11], the effects of morphological operations on kidney stone classification and analysis were investigated. The location and size of the kidney stone have been tried to be determined using GAC segmentation besides extraction and morphological operations. The proposed algorithms have been applied on several kidney images, and high efficiency has been achieved [11]. In one of the studies in the same direction, SVM was used for classification in automatic kidney stone detection. In the study, before classification, the image histogram equalization and embossing method, which evaluates color differences directionally, was tried. The proposed method was tested on 156 CT images with stones and healthy kidneys, achieving 98.71% accuracy [12]. In another study, a thresholding-based model has been developed with deep learning for the detection and scoring of kidney stones from abdominal non-contrast computed tomography (NCCT) images. The model is divided into four stages. Initially, 3D U-Nets were created for kidney and kidney sinus segmentation. Later, deep 3D dual-path networks were developed for hydronephrosis grading. Thresholding methods were used to identify and segment stones in the renal sinus area. Finally, the location of the stone was determined. As a result, the stone detection method reached 95.9% sensitivity and 98.7% positive predictive value (PPV) [13].

III. OVERVIEW OF THE PROPOSED METHOD
Kidney x-ray images are used to detect whether a person has kidney stones or not. According to this information, a person is detected to be healthy or patient. This detection is generally decided by a specialist doctor. A sample of healthy or patient images is shown in Fig. 2. Although a specialist doctor can distinguish the images given in the figure, some images cannot be detected by the specialist, or the detection process by humans takes time. Therefore, an algorithmic detection system is needed for the classification of the x-ray images. In this study, a decision support mechanism that determines whether an individual is patient or healthy is proposed by applying various machine learning and deep learning methods to kidney x-ray images. The block diagram of the proposed mechanism is shown in Fig. 1. In the first step of Fig.1, each image was scaled to 64 * 80 dimensions because kidney x-ray images were obtained with different sizes. Then fixed-size images were converted into grayscale images to extract gray level values in the second step. In the third step, gray level binary values were extracted from these grayscale images, and these values were saved in a CSV file with their tags. Since the number of data with a healthy label in the data set is quite low compared to the patient label, resampling methods were applied to the dataset to deal with imbalanced classes. After all these processes were performed, various classification methods, including machine learning and deep learning have been applied to the balanced dataset, and the test dataset has been evaluated in terms of precision, recall, and F1 score.

A. Dataset
The dataset is prepared by using 221 kidney x-ray images obtained from the Urology Department of Ataturk University.
Before the classification process has been applied, these images are subjected to various preprocesses. In the first step, different-sized x-ray images are converted into 64x80 fixedsize images. Then resized images are subjected to grayscale conversion processes in the second step. In the third step, the gray level values obtained from the image are extracted in the CSV file. These values consist of 5120 columns are labeled as patient or healthy according to the presence of kidney stones, catheters, or both found from the x-ray images. Images without any kidney stones or catheters are labeled as healthy, while images with stones or catheters are labeled as patients. This labeling process has been done by taking into account the opinions of the specialist doctors working in the Urology department. In the obtained dataset, 182 images have a patient label, while 39 images have a healthy label.

B. Method
In this study, a decision support system based on machine learning and deep learning that detects whether a kidney x-ray image is patient or healthy is proposed. Machine learning (ML) is a branch of artificial intelligence, and it offers powerful classification techniques to make predictions on test data by training existing data and analyzing big data inaccessible to the human mind alone [14]. Machine learning methods are widely used mainly in data classification, pattern recognition, and prediction. Machine learning concepts are used for many applications such as data classification, email filtering, face detection, disease prediction, fraud detection, and traffic management. Deep learning (DL) is a type of machine learning method, and the learning process takes place on an artificial neural network model with more intermediate layers.
Deep learning methods can classify large amounts of data with higher accuracy to provide analytical results based on the parameters and objectives of a particular framework [15]. Deep learning is mainly used in image segmentation, disease prediction, and recommendation systems such as convolutional neural networks, autoencoders, and restricted Boltzmann machines [16].
Within the scope of the study, various machine learning methods named Decision Tree (DT) [17], Random Forest (RF) [18], Support Vector Machine (SVM) [19], Multilayer Perceptron (MLP) [20], k Nearest Neighbor (kNN) [21] and Naive Bayes (BernoulliNB) [22] and deep learning Convolutional Neural Network (CNN) which is a feedforward neural network [23] have been applied. In the model training phase, the StratifiedKFold cross-validation method is applied to split the dataset. Also, the grid-search method is used to determine the best parameters belonging to the classification methods giving the highest accuracy rate

1) Resampling process
Since the data with the healthy label in the data used within the scope of the study are less than the data with the patient label, resampling is performed on the dataset. Among the resampling methods; undersampling, oversampling and SMOTETomek method, which is a combination of two methods, is applied in the scope of this study. Undersampling takes place by deleting a randomly selected section from the data belonging to the dominant class. In addition to bringing the data more stable, this method can shorten the running time of the classification method since it enables running with smaller data, especially when the data size is very large. On the other hand, as the information in the deleted data is lost, it can lead to underfitting problems. Oversampling can be defined as increasing the number of data belonging to this class by repeating a randomly selected part of the data belonging to the minority class. Since there is no loss of information in this method, it may be a superior method compared to undersampling. However, since some of the data are repeated precisely, it can lead to overfitting problems.

2) Classification
After resampling on the data set, train-test splitting is performed to be used as 80% training and 20% test data. The obtained train data is trained using StratifiedKFold crossvalidation (kFold = 5). Then, test data that are not used in model training is evaluated in terms of Precision, Recall, and F1 score metrics. Performance percentages according to the resampling methods and classification algorithms used are given in Tables II and III.   TABLE I  THE BEST PARAMETER VALUES FOR THE DATA SET ACCORDING  TO MACHINE LEARNING METHODS A general description of all classification methods applied within the scope of this study is given below: Decision Tree (DT): Decision Tree is a supervised learning method generally used for classification and regression analysis [24]. DT is expressed as a structural flow chart. Each internal node represents a test on an attribute, each branch describes a test result, and each leaf represents a class label [17,25,26]. Random Forest (RF): Random Forest, which consists of a combination of many decision trees, is used for more classification problems besides regression. A collection of tree-structured classifiers (h(x, Θk), k = 1,… ) represents RF where Θk are independent distributed random vectors and h1(x), h2(x).. represents the training set obtained from random vector Y. X is a margin function that is calculated by mg (X,Y)= avk I hk(X)=Y) -maxj≠Y avk I(hk(X)=j) where I (.) is an indicator. The mg representing the margin gives the measure of the change in the average number of votes in X, Y with any other class. The larger mg value means, the more confidence in the classification [25]. Support Vector Machine (SVM): Support Vector Machine is a supervised machine learning method proposed by Vapnik [19]. SVM classifies the samples dividing the training dataset into distinct classes by using a hyperplane. When the dataset consists of two-dimensional data, a linear classifier is used with a linear hyperline [19,27]. In this study, we use the linear classifier to distinguish the healthy and patient-labeled data. In the mathematical expression, x is a vector point, and w is a weight. This classifier aims to find the optimal plane where the distance between the two classes is the greatest to keep the margin value at the highest level. This case is called the maximum margin linear classifier, and the calculation of the margin is expressed in Fig. 3 and equation 1. (1) Multilayer Perceptron (MLP): Multilayer Perceptron is a popular feed-forward neural network due to its fast operation, easy applicability, and small dataset requirements [26]. In this network structure, units consist of an input layer, one or more hidden layers, and an output layer [20]. The input layer takes an activation vector externally and transmits it to the units in the first hidden layer via weighted links. After each layer calculates its activation, it transfers the activation to the neurons in the successive layers, as shown in Fig. 4.

Algorithm
Best parameter values  Each neuron i in the network is a simple processing unit that calculates the activation si based on incoming excitation called the neti. neti is calculated as seen in equation 2: (2) In this equation, pred (i) indicates the set of predecessors of unit i, while wij represents the weight of the connection from unit j to unit i. Qi is the bias value of the unit i. The activation of the unit i is calculated by passing the net input through a nonlinear activation function. Usually the sigmoid logistic function is calculated as follows: Having an easily computable derivative of this function makes the method advantageous. The derivation is shown below: (4) k Nearest Neighbor (kNN): kNN is a supervised learning method that estimates test data based on the samples closest to k values given in the feature space [27,28]. After training all existing samples, the method classifies new samples according to the similarity measure. As a result of the experiments performed with the Grid Search method, the optimal k value is found as 16. Naive Bayes (BernoulliNB): Naive Bayes method calculates the probability of finding the correct tag of a data from a test dataset by multiplying the probabilities of all factors affecting that result [29]. In the equation below; C is the class label, F values are the input data: (5) with the operation, there is a probability value that the input data whose class is unknown belongs to the C class. As a result of probability calculations, the number of class labels is determined as the class of the test data [29]. Convolutional Neural Network (CNN): CNN is one of the most famous successful methods that has been widely used in image processing in recent years. This method is based on artificial neural networks, whose network structure contains more intermediate layers, neurons with learnable weight, and bias. CNN input data, which consists of any image or digital data, differs according to the problem in terms of the dropout value, the number of layers, the activation function used, and the number of neurons [30]. Within the scope of this study, the CNN model is created using Keras library and Python programming language. The best result has been tried to be achieved by changing the learning rate, optimization algorithm, number of hidden layers, number of epochs, weight starting values, and activation functions as hyperparameters. Fig. 5 shows the developed CNN model to classify whether the person is healthy or patient.

IV. EXPERIMENTAL RESULTS
After the resampling process is completed, 80% of the dataset (182 image values) is trained, and precision, recall, and F1 score performance metrics are evaluated with the StratifiedKFold cross-validation method. The classification performance of the methods depends on the number of correctly detected classes (TP-Correct Positive), the number of healthy people identified as patients (FP-False Positive), and the number of patients identified as healthy (FN-False Negative). Using these values, the Precision and Recall values are calculated. F1 score value gives the harmonic average of the precision and recall values. Therefore, a high F1 score value is an essential criterion for a suitable decision support mechanism. General formulas of metrics used in the study are given in equations (6) - (8). Precision=TP/(TP+FP) Recall =TP/(TP+FN)  Table II shows the performances of the cross-validation method applied to the training model according to algorithms and sampling methods. While the average highest values in terms of precision (85.6%) and recall(88.9%) metrics are obtained with the combining SMOTE+ RandomUnderSampler (S + U) method, when evaluated in terms of algorithms, the average highest value for F1 Score (89.4%) is obtained using SMOTE sampling method. According to cross-validation scores of the applied algorithms, MLP and CNN achieve the highest values in terms of the evaluation metrics. MLP has an 87.2% precision rate and 92.4% F1 score rate, while CNN has a 100% recall rate.   Fig.6 shows complexity matrices obtained as a result of the classification of test data using S+U oversampling. When the confusion matrices seen in Fig. 3 are examined, it is seen that the BernoulliNB and CNN models classify the patient data 100%. Therefore the recall value obtained from these models is 100%. However, these methods couldn't correctly classify the healthy labeled data.

V. CONCLUSION
In this study, kidney x-ray images obtained from Atatürk University Research Hospital are used to classify patients and healthy individuals implementing machine learning and deep learning approaches. By using these methods, a decision support mechanism is proposed in a shorter time that enables the diagnosis of images that the specialist doctor has difficulty in diagnosing. Firstly, images are converted to the gray level values after they are scaled to fixed sizes. Then, a data set is created by obtaining gray-level numerical values from the images. Since this data set has imbalance classes, various oversampling and undersampling methods are used. In this way, the performance metrics of the methods increase significantly. Accurate detection of healthy individuals is as important as the detection of patient individuals in the detection of kidney diseases. In this respect, achieving high performance in the F1 score is one of the most important criteria. According to the experiments, DT has the highest F1 score rate with a success rate of 85.3% using the S+U sampling method.