Enhancement Of Breast Cancer Diagnosis Accuracy With Deep Learning

Breast cancer is a highly fatal disease that is very prevalent among the female population. In this study, a new type of approach is proposed with the aim of improving the accuracy of breast cancer diagnosis, an important problem of our present time, by means of deep learning, one of the techniques in machine learning. In the designed method, the original data set of Breast Cancer Wisconsin being available in the Irvine Machine Learning Repository of University of California was used. Within this data set, there were 699 data consisting of 10 independent variables and 1 dependent variable. The complete utilization of the entire data set was ensured by correction of 16 incorrect data. A normalization process was applied in the data set for the purpose of reducing the time required for learning process. The used data set was allocated as 80% for training, 10% for validation, and 10% for testing. An artificial neural network was designed for the deep learning model. The neural network was set up of a total of 5 layers which were an input layer with 10 neurons, 3 hidden layers with 1000 neurons for each layer, and an output layer with 3 neurons. The software, developed for implementation was written by using Spyder which is an interactive development environment for Python programming language. In addition, Keras neural network API was used. The performance of the model was evaluated with Confusion Matrix and ROC (Receiver Operating Characteristic) analysis. According to the test data obtained at the end of the training, it was observed that the implemented model provided successful results. It is considered that the proposed method will contribute to the improvement of breast cancer diagnosis accuracy.


Introduction
Being a disease with fatal outcomes, breast cancer is considered as the second most common cancer type among females around the world. According to the data provided by World Cancer Research Fund International, 2 million new cases were observed in 2018, and this number is gradually increasing [1]. In Turkey, twenty-five thousand patients are diagnosed with breast cancer each year. This rate continues to increase in developing countries. For this reason, the importance of accurate and early diagnosis has become higher than ever. Moreover, the rate of machine learning usage in diagnosis of diseases within the field of medicine is growing. These techniques provide assistance to physicians in achieving an early and accurate diagnosis. In this study, a new type of approach is proposed with the purpose of improving the accuracy of breast cancer diagnosis by means of one of the techniques found in the concept of machine learning that is called deep learning.
Artificial intelligence enables the machines to operate by imitating the human capabilities of learning and decision making. Machine Learning (ML) is a discipline of artificial intelligence that ensures the software to estimate results with better accuracy, without the need to write explicit codes to perform the task mentioned. In order to design a model using the conventional machine learning techniques, the feature vector must be extracted initially. Moreover, there is a requirement of expert supervision in these techniques, and the crude data cannot be directly used. However with deep learning, the supervision of an expert is not needed, and the direct use of crude data is possible [2,3]. Deep learning is one of the many methods found within machine learning. Deep learning methods are utilized in machine learning in terms of quick learning and implementation of large and complex data. Deep learning is widely used in many software disciplines such as computer vision, speech and sound processing, natural language processing, robotics, bioinformatics, computer games, search engines, energy generation, automotive industry, aviation, manufacturing, online advertising and financing, and etc. [4]. It is known that deep learning provides highly successful results in processes of estimation and classification.
Today, there are quite many researches present in the literature regarding the subject of breast cancer diagnosis. In addition, there were also researches that were conducted using the data set of Wisconsin Breast Cancer. Baneriee [8]. In [9], the feature selection and feature extraction techniques were combined together by using deep learning in order to predict the outcome of the breast cancer.
In this study, a newer method was proposed to enhance the accuracy of the diagnosis of breast cancer by means of deep learning. In the designed method, the original data set of Breast Cancer Wisconsin, a data set used in diagnosis of breast cancer that took place in the literature, was used. By carrying out a classification process, it was ensured that the method gave results as benign or malignant at output according to the values of patient groups in the data set. Confusion Matrix metrics and ROC (Receiver Operating Characteristic) analysis were utilized in the evaluation of model performance.
In the second part of the paper, the experimental studies and the methods used are given. Section 3 involves the test results obtained from the experimental studies and the evaluation of the study is given in section 4.

Material and Method
In this study, the processes conducted within the method that was designed with deep learning with the aim of enhancing the accuracy of the diagnosis of breast cancer were explained as follows.
The Flow Chart of the designed model can be seen in Figure 1.

Breast Cancer Wisconsin Original Data Set
In the method proposed for breast cancer diagnosis, the original data set of Breast Cancer Wisconsin available in the Machine Learning Repository of University of California, Irvine, and which also took its place in the literature, was used [10]. The data set only contains numerical values. Within this data set, a total of 699 data were available, consisting of 458 benign and 241 malignant patient groups. The data set was constituted of 10 independent variables and 1 dependent variable. 16 of these data had missing feature values. The feature names and values of the data found in the data set is provided on Table 1. A snapshot from the data set can be seen in Figure 2.

Imputation of Dataset
The training of a model with a data set containing missing values may substantially affect the quality of deep learning model. For this reason, the utilization of entire data set in training was ensured by correction of 16 incorrect data found in the data set with a statistical missing value analysis technique called Mean Imputation. This method functions by calculating the mean value of readily available values in a column, and then substituting the missing values in each column independently from each other [11].
The method does not have an effect upon the class variable of "Sample Code Number" feature located in the data set. In order to reduce the dimensionality of the data set and to prevent the addition of trivial features, this column was excluded from the data set.

Normalization of Dataset
Application of normalization on the crude data increases the data efficiency. A normalization process between the ranges of 0-1 was applied in the data set for the purpose of reducing the long learning period caused by the size of the data set. The MinMaxScaler method was used in this process as shown in Equation 1 [12].
Here, z is the normalized data, x is the input value, min (x) is the smallest number in the input set, max (x) is the largest number in the input set.

Splitting the Dataset
The data used in deep learning are separated in 3 clusters as validation, training and testing. The allocation of available data among these three data sets is vital for the objectivity of the success. As a result of various tests, the data set in the suggested model was allocated as 80% (559 data) for training, 10% (70 data) for validation, and 10% (70 data) for testing. The process of allocation is shown in Figure 3. Cross validation method was used in the implementation of this process [13].

Software
The software which was developed for the study was written using Spyder for Python programming language, an interactive development environment capable of advanced editing, interactive testing, debugging and introspection. In addition, Keras neural network API was used for deep learning in the developed method. Keras is a high level neural network API, supporting Python. It is Train Validation Test able to convert the results rapidly, highly modular, minimalist, and has extensible features. Keras supports CNN (Convolutional Neural Network), RNN (Recurrent Neural Network) and a combination of the Neural Networks [14].

Neural Network Model
Keras Sequential model was used for implementation. An artificial neural network with 5 layers was designed by carrying out several tests on determination of the Artificial Neural Network parameters. Within this neural network, there is an input layer with 10 neurons, 3 hidden layers consisting of 1000 neurons for each layer, and a 3-neuron output layer. The designed neural network is shown in Figure 4.

Figure 4. Designed neural network
The number of neurons in the input layer of the neural network is 10 in that there are 10 different inputs in the data set used for implementation. The value which the data set may get in the output layer may be 2 or 4 (2 for benign, 4 for malignant). Because that sigmoid activation function is used at the output layer, the number 4 is represented in 3 bits as a binary value. For this reason, the number of neurons at the output layer of the network was designed to be three.

Activation Functions of the Neural Network:
The activation functions used in the layers of created neural network are described as follows.

ReLU (Rectified Linear Unit) Activation Function
ReLU (Rectified Linear Unit) Activation Function was used in the input layer and hidden layers of the neural network. ReLU Activation Function can be seen in Figure 5. ReLU is an activation function that recently gained popularity for its practicality in deep learning. It enables the neural network to learn faster. The function zeroes negative values [15]. The mathematical expression of the function is provided in Equation 2.

. Sigmoid Activation Function
Sigmoid activation function was used at the output layer of the neural network. Sigmoid Activation Function is a function that gets a value between the ranges of (0, 1) as seen in Figure 6. The mathematical expression of the function is provided in Equation 3.

Dropout
Dropout is one of the methods that is used to prevent memorization. In each iteration, it randomly removes a number of neurons from a layer at a specified rate. The process of dropout is described in Figure 7. Crossed units have been dropped out of the network.  Regularization techniques are also used to increase performance in Dropout method by prevention of overfitting. Regularizers L2 formula is also used along with the Dropout process in the designed model. In this way, overfitting during the training of the network is minimized.

Optimization
The learning process in deep learning applications is essentially a problem of optimization. Optimization techniques are used to find the optimum value in solving non-linear problems. RMSprop, adagrad, adadelta, adam, adamax and such other optimization algorithms are widely used in deep learning applications. Moreover, there are differences between each of these algorithms in terms of performance and speed. In this study, the optimization algorithm of Adaptive Moment Optimization (Adam) was used.

Adam Optimization
The Adam algorithm is simple to implement, has fast learning time, is computationally efficient, requires little memory, and is well suited for problems with large data or parameters. [17].
Here, : : : : , : Here, the exponential average of the gradient and the squares of the gradient computes for each parameters (Eq 4, and Eq 5). To determine the learning step, learning rate is multiplied by the exponential average of the gradient and it is divided by the root mean square of the exponential average of square of gradients (Eq 6). Then update is done. In practice the hyper parameter values for = . and = .
. Epsilon is ( ) is a very small number ( − ) to avoid dividing by zero.

Loss Function
Loss function is a type of function that measures both the error rate and performance of a designed model. In deep learning, the last layer of a neural network is the layer where the loss of function is defined. In deep learning applications, the function calculates the dissimilarity between the estimation of the designed model and the required real value. In case that a model with good estimation capability is designed, the difference between the real value and estimated value will be lower. In other words, its loss value will be lower. An output of higher loss value indicates that the designed model contains flaws. However, in a finely designed model, a loss value near zero must be expected. In the literature, there are various loss functions such as mean squared error, mean absolute percentage error, mean squared logarithmic error, hinge, logcosh, sparse categorical cross entropy, binary cross entropy, kullback leibler divergence, poisson, cosine proximity, and many others. In this study, the sparse categorical cross entropy loss function was used.

Sparse Categorical Cross-entropy Loss Function
Computes the categorical cross-entropy between predictions and targets. This loss function is a choice for multi-class classification problems. The mathematical equation is given in Equation 8 [19].
Here p refers the predictions, t refers the targets, i refers the data point and j refers the class. Here, accuracy metrics are used with this loss function.

Early Stopping
In the models where training is done by iteration with data, the period of learning must be terminated at the right time. Otherwise, if training is not stopped, all of the samples in the data set for training will be memorized by the system. This results in a decrease in the capability of estimation of unknown samples. The time of training termination depends on the choice of algorithm. In case of an early termination, performance of the system will decline in that it could not fully analyze the data. The same outcome will also arise in case of over-training. In case of an overfitting possibility for the program, a parameter of early stopping was defined. In the program, this parameter was defined as 10, meaning that if the neural network returns the same result in 10 consecutive times, the training will be stopped regardless of number of iterations.

Evaluation of Model Performance
In deep learning, evaluation of model performance is important for the determination of the system performance. The performance of the model realized in this study was evaluated with Confusion Matrix and ROC analysis.

Confusion Matrix
In the classification problem in machine learning, a confusion matrix, is a special table layout that allows the evaluation of the performance of an algorithm. Each row or column of the matrix represents the predicted values, while each row or column represents the values in an actual value [20].

Receiver Operator Characteristic Curve (ROC)
The Receiver Operating Characteristic Curve is a very good method to measure the performance of a classification model. Uses the data obtained in the confusion matrix. ROC analysis investigates and employs the relationship between sensitivity and specificity of a binary classifier [21]. The overall accuracy of the test increases as the ROC curve approaches the upper left corner. [22]. The model performance is determined by looking at the area under the ROC curve (AUC). The AUC value is between 0.5 and 1, which is the best when approaching 1.

Findings and Discussion
In this section, the results were provided with the utilization of techniques of Confusion Matrix, ACC, ROC curve and AUC which were used in the performance evaluation of the model designed with deep learning for enhancement of breast cancer diagnosis accuracy.
Epoch was determined as 100 by the results of conducted tests. The confusion matrix of 70 testing data which was obtained as a result of training was shown in Figure 9. In the Confusion Matrix, 41 actual malignant data were identified as malignant in the TP value. 1 actual benign datum was determined as malignant in the FP value. 1 actual malignant datum was identified as benign in the FN value. Lastly, 27 actual benign data was identified as benign in the TN value.

Confusion
The values of Accuracy, AUC, F1 Score, True Positive Rate, True Negative Rate, Positive Predictive Value, and Negative Predictive Value obtained following the training of the deep learning model were presented on Table 2.

Table 2. Calculated Parameters
The Receiver Operating Characteristic Curve (ROC), obtained using the confusion matrix and parameter values on Table 2, is shown in Figure 10.  The AUC value was calculated as 0.9983.
For the purpose of measuring the performance of the proposed model during the phases of training and validation, the following graph in Figure 11 was obtained by measuring the accuracy values (ACC) in each iteration.

Figure11. Model Accuracy.
The graph, created with the measurement of loss function outputs for each iteration in both the training and the validation phases, is shown in Figure 12.

Conclusion
In this study, a new model was suggested with the aim of enhancing the accuracy of the diagnosis of breast cancer with deep learning method. The assessment outcomes of the model that was designed using Breast Cancer Wisconsin Original Data Set were presented in the 3 rd Section of this study. The AUC value of the model was measured as 0.9983 and its value of accuracy was measured as 0.9857. The evaluation results indicated that the proposed model had a high rate of performance. It is thought that this newly developed method will contribute to the improvement of breast cancer diagnosis accuracy, which is a crucial problem of our day.