A Comprehensive Evolution for Applicability of Machine Learning Algorithms on Various Domains

version of this paper was presented at 9th International Conference on Advanced Technologies (ICAT'20), 10-12 August 2020, Istanbul, Turkey with the title of “A Comprehensive Evolution for Applicability of Machine Learning Algorithms on Various Domains”.


Introduction
Machine learning is a type of artificial intelligence that can achieve results that are not explicitly programmed and focuses on the computer software which learns to grow and develop on their own when disclosed new data. In recent years, it has become one of the main pillars of information technologies and has become a part of our live. With the increasing amount of data, intelligent data analysis is becoming more common as an essential component of technological progress [1,2].
With wide-ranging applications, machine learning is one of the significant and spreading areas of computer science. Machine learning techniques are successfully used to solve various problems in image processing, pattern recognition, robotics, and etc. [3]. Machine learning is the basis of numerous important applications such as self-driving cars, email anti-spam, speech recognition, face recognition, web search, social networking, product recommendations and advertising [4].
Supervised and unsupervised learning are two main classes of machine learning methods. In supervised learning that requires an exterior teacher, each output is taught what the appropriate answer should be. In other words, supervised learning aims to gain an understanding of input to output mapping with suitable values fed by a supervisor. During the learning process, global information may be required. Supervised learning uses classification algorithms and regression methods. Neural networks, linear regression, logistic regression can be given as examples of supervised learning.
Unsupervised learning does not use such external supervisor and is based upon only local information. It uses unknown and unlabeled training data and also the number of clusters is unknown. The aim is to discover certain patterns in input data to see how it usually behaves. Self-organizing map and k-means are kinds of unsupervised learning methods.
In this study, several supervised and unsupervised machine learning algorithms are investigated both in theoretical and practical aspects. For each algorithm, we first present its theoretical foundation in brief then we apply it on an example domain to show its effectiveness. We also comment on the time consumption of the algorithms. Firstly, we study on linear regression methods including brute force, gradient descent, normal equation and their implementations on covid-19 coronavirus data. Following this, we investigate logistic regression and its application which is a feature classification example applied on segmented images. Moreover, we work on selforganizing map and k-means algorithms and their implementations for color reduction.
The rest of the paper is organized as follows: Linear regression methods with its covid-19 coronavirus data applications are presented in Section 2. Logistic regression and its usage in image segmentation are given in Section 3. Section 4 includes the self-organizing map and k-means algorithms which are implemented for color reduction. The paper ends with the conclusion in Section 5.

Linear Regression Methods
Machine learning techniques analyze the association of two variables. For each instance, the hypothesis line fitting through the training set is aimed to determine and these variables are known [5]. When the relationship between a quantitative result and a single quantitative explanatory variable are examined, the most commonly considered analysis method is linear regression [6][7][8][9]. Linear regression is represented with the Equation 1.
In this formula, the coefficients of Ɵ0 and Ɵ1 are x0 and x1, respectively. The coefficient x0 is known as Bayes input and always equals to 1 [10]. So, the hypothesis function becomes the form of Equation 2.
The function of the parameters Ɵ0 and Ɵ1 called as cost function J(Ɵ0,Ɵ1), is given in the Equation 3 where m is the number of training examples. The aim is to minimize the cost function.
The row that gives the minimum summation is determined, so that the values corresponding to the same row in Ɵ matrix are obtained. By the values of Ɵ0 and Ɵ1 in this row, the hypothesis function h(x) is constructed. In this section, various popular linear regression methods which are brute force, gradient descent and normal equation are applied on Covid-19 coronavirus dataset given in Table 1. This data taken from [11], includes the number of people caught with coronavirus and died from coronavirus in United States within the first 20 days of April 2020.

Brute Force Method
One of the linear regression methods that warrants to produce an exact solution is brute force hence it tries set of all possible candidates and may produce optimal solutions in case of small datasets are given. On the other side, its execution time generally reaches an exponential order of growth for larger datasets. In most practical settings, brute force may be unacceptable for tackling with problems, since the execution time may span many years [12].
When brute force method is applied for the prediction of new infected patients and new deaths from Covid-19, hypothesis function is iteratively calculated with different Ɵ values Ɵ0, Ɵ1. To change these Ɵ value couples incrementally, range is intuitively defined as [-3000 3000]. Ɵ matrix is constructed as seen below: Then, for each i th row contains Ɵ parameters pair, the result matrix that contains estimated y values is constructed by the Equation 2. The differences between y value and estimated y value is calculated and the sum of the differences in the same row is calculated. Hypothesis function is created with the coefficients Ɵ0 and Ɵ1 in theta matrix line corresponding to the line in result matrix giving the closest results to real new cases value. The number of people caught Covid-19 are calculated as 24500 with the below theta values.
ℎ (20) = 34500 + (−500) × 20 = 24500 By the hypothesis function, the number of people died from Covid-19 of 20 th April is predicted as 3170. The result is not good enough because learning rate is taken too large and the brute force algorithm overshoots the minimum. If it is defined as smaller, better result can be obtained but runtime of the algorithm becomes too long to be implemented.

Gradient Descent
Gradient descent method that is an optimization technique for determining the local minimum. This method is very popular among scientists because it is straightforward and terminates by executing relatively less step count. For the hypothesis function h(x), that is differentiable in a given limit, the direction of its fastest decline is the negative gradient. Gradient descent algorithm is applied to detect local minimum [13].
In the gradient descent algorithm, Ɵ values are calculated by the Equation 4 for j equals to 0 and 1, separately. The algorithm determines the right Ɵ values, by changing Ɵ0, Ɵ1 values to reduce cost function given in Equation 3 until ending up at a minimum cost. For each step of gradient descent all the training examples are used. The hypothesis line is shown on the diagram of new cases in Figure 1. The difference between the actual value and the calculated value is not small enough but acceptable because the dataset size is not large enough for the system to learn well.

Normal Equation
The normal equation method is an approach to fitting a mathematical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the models [14].
Normal equation is the set of equations arising in the least squares method whose solutions give the constants that determine the shape of the estimated function [15]. This method solves the Ɵ matrix analytically by the Equation 5 in matrix notation [16]. The hypothesis with these Ɵ values minimizes the cost function in Equation 3 or squared error in other words [17].
Using 19-day data given in Table 1 The results obtained by normal equation are better than gradient descent algorithm. When the iteration number is increased in the gradient descent algorithm, there is no significant difference and almost the same results are obtained with normal equation but runtime increases.

Logistic Regression
Logistic regression method is appropriate to make predictive analysis when the dependent variable is binary. In this method variables are categorized such as win/loss, alive/dead, healthy/sick or success/failure and so on [18]. Hence, it can be named as binary classification. For example, to predict whether a tumor is malignant or not, logistic regression can be used. It is also used to determine the relationship between one dependent binary variable and interval, ordinal or nominal independent variables. This method can be extended to model multiclass classification such as determining whether an image includes a person, flower, dog, and etc. Every object detected in the image is assigned a probability between 0 and 1 [19].
Image Segmentation labeled dataset [20] from the UCI Machine Learning Repository is used as training dataset for this part. Segmentation is to subdivide an image into its component objects or regions [21]. The Image Segmentation instances were drawn randomly from a database of real seven outdoor images. The images were handsegmented to create a classification for each pixel. Each pattern has 19 continuous attributes and corresponds to a 3x3 region of an outdoor image. There are 210 patterns in the training set and 2100 patterns in the test set. There are 300 patterns from each class in test data.
Classification includes assignment of data points to known classes by using supervised learning. Logistic regression algorithm is applied on the Image Segmentation dataset to classify the patterns into one of the seven classes which are predefined as brickface, sky, foliage, cement, window, path and grass.
During dataset processing, the third attribute, that is region-pixel-count: the number of pixels in a region, is not considered because it has the same value as 3x3=9 for each data set. There are 18 features after removing third feature. The data is normalized to the range of [- 5 5]. The logistic regression hypothesis cannot be defined regarding to the large features. To use less features, image segmentation data is observed by plotting the training data with three features. The classification result obtained by logistic regression with three features (attributes 11, 12 and 16) is shown in Figure 3 that each data point is assigned to one of the seven classes.
As seen from Figure 3, some data points from different classes are overlapping. To avoid this issue, classification can be done with two features. So, the logistic regression is implemented on dataset with two features that are experimented by every pair combination out of three attributes. As a result of experiments, logistic regression is applied with attributes 11 and 12. The classified data points are shown in Figure 4.

Self-Organizing Map for Color Reduction
All algorithms mentioned so far fall into the supervised learning in which labeled training data assumed to be available. In this section, some important machine learning algorithms that are trained by using unsupervised learning are introduced and a sample color reduction implementation in which they are used together is given.
Neural networks are used for applications where formal analysis would be difficult or impossible, such as pattern recognition and nonlinear system identification and control. The following section focuses on a particular type of neural network model which is known as self-organizing map [22].

Self-Organizing Map
The artificial neural network (ANN) was first proposed by Teuvo Kohonen in 1982 [23]. ANNs inspired by biological nervous systems are aimed to model the information processing paradigm [22]. A self-organizing map (SOM) that is trained using unsupervised learning approach is a special type of ANN.
A given input data pattern; SOM has the capability to classify data itself without any external supervision. The training process in SOM is competitive. The basic components are neurons which allow it to take meaningful decisions [24]. It is also a dimension reduction technique that uses a mapping or projection procedure from a large dimension to a smaller dimension. It helps in reducing the dimensions of data by grouping the input patterns that have a similarity among themselves [25].

K-Means
One of the most widely applicable clustering methods is k-means [26]. This algorithm is used frequently in many areas such as customer segmentation, market analysis, computer vision and etc. K-means algorithm applied by selecting a convenient starting strategy acts as a very efficient color reduction method [27].
After the number of clusters k value is determined, k center points are selected randomly without replacement. For each input data point, the distance between randomly determined center points is calculated. The data is assigned to a cluster based on its closest center point. Then center points are determined again for each cluster and clustering is performed according to the new center points. These steps continue until the system becomes stable.

Color Reduction
The image color reduction is an essential technique used in many applications, such as compression, presentation, transmission, segmentation and analysis of color images [27][28][29]. For a defined particular data type, the RGB color cube which is seen in Figure 5, is a 3 dimensional array that includes all the colors of the image. Three color cubes are defined for RGB images in MATLAB. For an RGB image in uint8 type class, there are 256 pixel values for each color plane. Hence, there are 224 colors in the color cube. Regardless of what colors are used in the image, the same color cube applies to all uint8 RGB images.

Figure 5. RGB color cube for unit8 (256 colors) image
Color reduction is a method that firstly the color cube is partitioned into smaller cubes. Then each color is matched to the color value at the center of the small cube that it corresponds.
The goal of the color reduction is to minimize disparity between the colors of the original image and the quantized colors. Our aim is to decrease inconsistencies between the actual and reduced colors. Figure 6 shows the color reduction of a uint8 image that is obtained by taking a slice of RGB color cube. Figure 6. A slice of RGB color cube SOM is one of the most effective color reduction methods and generally produces desired results [29,[30][31][32][33]. In SOM algorithm, the initial weights can be assigned randomly or can be chosen from data randomly or these values can be defined by k-means method. In this study, SOM is applied with initialization by k-means algorithm. SOM based color reduction method is applied on 64 colors parrots image seen in Figure 7.a.
The algorithm generates the color reduced images obtained by different number of colors that are also the number of clusters. After color reduction, the output images constituted of 32 colors, 16 colors and 8 colors are seen in Figure 7.b, c and d, respectively. The experimental results show that, combining SOM with k-means algorithm provides an effective color reduction that the dominant colors of the input image retain even if the number of colors is reduced to 8.

Discussion and Conclusions
In this article, we explain machine learning techniques using examples to make them as simple as possible. Brute force is a straightforward approach directly based on problem statement and definitions. This approach guarantees for finding the solution and prefers simpler hypothesis, but it is not effective for large data. Applying normal equation as a linear regression method for computing Ɵ provides some advantages such as no need to choose learning rate and not required to make many executions. This technique is not good for large data because when the matrix size is large, the algorithm works slowly. The gradient descent method requires selection of α and needs much iteration but works well even if data is large. Logistic regression is one of the simple ways to implement binary classification and can be extended to multiclass classification. This method allows models to be updated easily to adopt new incoming data, unlike support vector machines or decision trees.
Neural network approach having ability of capturing complex relationships is suitable for large training set. Even if training data has many features, the network can learn the system. SOM technique, which is a type of neural network, provides easy data mapping. Because of its suitability for data compression, it is a preferable method for color reduction. The implementation of SOM-based color reduction in this paper, produces acceptable results.
We clarify that the effectiveness of the machine learning techniques depends on domain which they are applied. In another words, this study shows that selecting the appropriate machine learning method to tackle with a given problem domain is of utmost importance. Machine learning techniques studied in this paper, their advantages, disadvantages and possible domains that each method giving better results are summarized in Table 2.
As a future work, distributed machine learning techniques [34,35] that provide parallelization of the training process and scale to larger datasets will be considered. We also plan to study on deep learning one of the ways of executing machine learning. In this manner, we aim to use graph convolutional networks to solve NP-Hard graph theoretical problems such as vertex cover, dominating set and independent set which are popular and have wide application areas.