Sınıflandırıcı Performanslarının Gauss Karışım Modeline Uygulanan Beklenti-Maksimizasyon Algoritmasına Göre Analiz Edilmesi

Parametric density estimations i.e., maximum likelihood, mixture model, bayesian inference, maximum entropy are frequently used when type of distribution is known or predictable. Expectation-Maximization (EM) or a variable step learning algorithm are most successful ways for obtaining maximum likelihoods of distribution parameters. In this paper, we aim to present implementation of the EM algorithm to multidimensional Gaussian mixture model (GMM) that includes three different distributions. In this study, the statistical distribution is obtained from Gaussian distribution and parameters which are mean and covariance matrices for each distributions are used for estimation process. Original feature vectors and their estimates are compared in term of similarity as well as obtained results are presented and discussed in details. In addition, each distribution for bifurcated dataset is indicated. Finally, Bayesian, k-NN and Discriminant classifier methods are implemented to GMM and the performance of these methods are analyzed.


Introduction
Density estimation is commonly used in statistical theory. It forms the significant part of classification methods and pattern recognition problems. In general, density estimation techniques can be classified into two major groups that are parametric and nonparametric estimation [1]. In non-parametric methods such as parzen windows and k-NN, there is no information about the type of the distribution and its parameters. In this case, probability density function should be found on related point and then classification should be made according to the classification conditions which vary with the type of the classification design. However, in parametric methods which are maximum likelihood, bayesian inference, maximum entropy and mixture model, it is assumed that type of the distribution is known and computed related parameters vary according to the type of the distribution. When recent studies [2,3] about mixture model are examined, it can be attained that, used datasets are not observable in 2D and 3D space [4].
Consequently, this makes difficult to preprocess states such that non-pre-prediction of data or ignoring outlier process before implementing EM algorithm. Thus, this study aims to estimate unknown parameters of data which are assumed to be composed of more than one distributions, and also it can be denoted by two features. In this work, we aim to present application of the EM algorithm in order to estimate distribution parameters which have mixture model and the accuracy of found parameters are computed. During this process, datasets are supposed to be as mixture model and all distributions including training dataset are Gaussian. Also, computed means and covariances of each distribution are obtained via a varying step learning algorithm and predicted parameters. In evaluations, estimated parameters and their consisted distributions are compared with original distribution. Also, some classifiers are implemented to two different datasets. In order to evaluate the accuracy of the classifier, classifiers are implemented to GMMs and they are compared by the help of estimated parameters and prior probabilities. Moreover, the contribution amount of training data and test data to the classifier performance are determined.

Mixture Model
Mixture model (MM) is the fundamental part of parametric density estimation schemes that based on assuming any type of distribution which can be Gaussian, Exponential, Rayleigh etc., or combinations of them. By a majority, this model is used when information about the type of the distribution does not exist or data are association of multiple distributions [5]. MM is commonly used in signal and video processing applications. This model especially uses when data is observable or it can be expressible in 2D space. The main goal of this method is density function estimation which is commonly used for classifier design or pattern recognition. In general, any probability density distribution (pdf) which has two parameters can be expressed by (1): where K is number of elements of distribution in dataset, φ includes two parameters and Zk and θk denote to symbolic parameters. In (1), wik corresponds to membership weight/probability and it can be expressed as: (2) where N represents the number of feature vectors in mixture model and αm denotes the probability of each distribution. The number of N does not have significant impact on parameter estimation process. However, during designing of classification, it becomes very precious, because if so many data is used in order to train the classifier, accuracy of classifier will increase automatically.
On the other hand, increase in the number of N will increase computational process of the algorithm. Membership weight vector specifies correlation of each vector between another feature vectors under Bayesian rule. It should be noted that, since membership weight vector represents all probability of feature vectors.

Expectation-Maximation Algorithm
The EM algorithm is any type of parametric density estimations which is an iterative method for the purpose of finding maximum likelihood (ML) or maximum a-posteriori (MAP) of parameters. The algorithm consists of four steps as follows: e-ISSN: 2148-2683 28 Initialize step can be started with two ways which are determination of initial parameters and usage of prior probabilities to compute randomly E-step for each class or determine initial parameters [6][7][8]. It should be noted that if there is any chance of prereviewing of the data, parameters can be selected with sense of proportion instead of random determination. Thus, if the selected values initially divergence from actual value, process load and iteration numbers of the algorithm will increase.
In E-Step, assumed Zk and θk which correspond to mean and covariance matrices respectively in this study are given or they are randomly selected and computed from membership weight vector in (3). In M-Step, it is focused on finding parameters. In order to estimate prior probability of each classes, from (3), it is derived at each iteration as follow: In here, k=1,2,…,K and i=1,2,…,N where K shows the number of classes. In (4), Nk denotes number of feature vectors for each classes. As a result, the updated prior probabilities of classes can expressed below: After above derivations, maximum likelihood of Gaussian parameters can be found. Here, x={x1, x2,……, xN} is assumed as Gaussian distribution and also joint pdf of multiple observed data can be written as: and its multivariate Gaussian pdf becomes, In (6) and (7), N denotes the size of covariance matrix. In order to compute maximum likelihood of multiple observation of mean vectors, the following equations can be used.
Taking derivative of (8) with respect to  , considering membership weight vector and setting it to zero, we obtain the following equation: To compute maximum likelihood of multiple observations of covariance matrix: Rewriting the log-likelihood by using trace trick: Taking derivative of (10) with respect to ∑ -1 by taking into account membership weight and setting it to zero, we get: At the end of M-Step, algorithm goes to E-Step and recalculate parameters for each iteration. In the algorithm, prior probabilities and parameters converge to their actual values for each iteration.
To terminate the algorithm, actual values should be very close to the estimated parameters. To check this, l(φ) and its logarithm are defined in equations (14) and (15) respectively; Note that, logarithm function has monotonic increase. Thus, there is no difference between logarithmic function and original function. Here, the main purposes are observing variation in (14) and deciding when the algorithm is terminated. As a result, (15) becomes as following equation: Finally, in order to terminate the algorithm the stopping criteria should be defined. To define stopping criteria, assuming that there is no significant change in log(l(φ)). If changes are minor and obtained results are similar, iteration stops and algorithm terminates. Therefore, parameters are estimated by the help of E-Step, M-Step and their iterations.
Note that, change value that denotes the change of (15) should be selected sensitively. If it is selected very big, the algorithm performance reduces and as a result, the accuracy of estimated parameters become farther away from its actual value. If it is determined as very low the algorithm becomes more complex, whereas it takes time to compute this change as well as iteration steps increase. Thus, it is recommended that this value should be chosen according to the importance of practice.
In this study, actual feature vectors are illustrated in Fig. 1. From this figure, it is apparent that mixture data are composed of three different Gaussian distributions that have different mean and covariance parameters. Before the algorithm is implemented, assuming that, prior probabilities of each classes are equal to each other which are 0.33. Other parameters are defined randomly. In each iteration, (15) is recomputed until a significant change occurs in log(l(φ)) via estimated maximum likelihood of parameters. At the end of this operation, equations (11) and (12) are slightly closer to their actual values. This process is repeated for each distribution. Then, new mixture data which are generated randomly through estimated means and covariances are redrawn and compared with raw Gaussian mixture data. The results are illustrated in Fig. 2.

Class Separability Measures
Before classification, the separability of the classes should be explored. This analyze is called as Class Separability Measure (CSM). This process checks the separability of each class between other classes and gives a scalar value. If this value is a great, it means, concerning classes are separable with good accuracy. As pointed out in Bayes rule, the classification error probability depends on the log-ratio as follow: (17) also gives a useful information concerning about the discriminatory capabilities associated with an adopted feature vector x and this can be used as a measure of the underlying discriminating information of class w1 with respect to w2. If (17) is written by taking into account the pdfs, we get: For multiclass problems, the divergence is computed for every class pair ωi and ωj. The sum of divergences are calculated as in (19): In case of Gaussian pdf, it becomes: Finally, the average class separability or average divergence can be computed using the average divergence as follow: where c is the number of classes. After implementing CSM analyzes to dataset, the obtained results can be found in Table 1. For both datasets1 and dataset2, it is clearly shown that the best separable classes are first and second. Because of that, difference between means of absolute classes are greater. The condition of good separability performance is that their means of Gaussian distribution should be far from each other. In addition, in case of covariances of classes are arbitrary (uncorrelated), the performance of CSM increases.

Classification of Datasets
In this section, some classification methods which are used in this study such as Minimum Risk Bayes Classification (MRBC) and Discriminant Classifier (Linear Discriminant Analysis) are explained in details. Then, the performance of these classifiers are analyzed on 2D and 3D implemented EM algorithm GMM datasets and also obtained results have shown in Table 2  Especially, for k-NN classification, the effect of k parameter on classifier performance is analyzed. The obtained results have shown in Table 3.

Minimum Risk Bayesian Classification
A Bayesian classifier is based on Bayes rule and the main idea is that role of a natural class is to predict the values of features for members of that class. Basically, this classifier calculates pdfs of each classes by taking mean vectors and covariance matrices for Gaussian distributions. Furthermore, the prior probabilities of the classes are taken into consideration when this operation is carried out.
For multi-classes cases, total risk or associated cost with ωk expressed as follow: where k=1,2,..,c and λki is a symbolic phenomenon and called as cost function.
If the concerning class error costs are equal, it becomes 0 or 1 and it is called as hard decision function which is represented in (23).
In case of error costs of inter-classes are not equal, cost function transforms to another form. Thus, minimum error classification criteria is defined and the minimum the average risk is calculated with the help of rk and cost function λki as shown below: The aim of this process is to minimize ℓi and assign x vector to proper class which offers minimum risk. To minimize ℓi, the statistical term Pr(ωi|x) should be maximized. The function of true classifying is expressed as follow: Note that, the assignment is made with respect to the minimum cost (error). For instance, in (27), x feature vector is assigned to ωc when the case is λ1cPr(ωc|x)< max (λ1c-1Pr(ωc-1|x),…, λ11Pr(ω1|x)). In special case prior probability of each classes are equal, the classification decision can be given by checking Pr(x|ωi).

k-NN Classification
k-NN classification is the one of the non-parametric density estimation methods. This method does not interest in the distribution parameters unlike to MRBC or other parametric methods. The purpose of this method is to calculate density function using k nearest parameters of neighbors as follow: where i=1,2,…,c and c denotes the number of classes. Also k is a constant (must be chosen odd) and denotes the number of nearest neighbors that will be processed. Ni expresses the number of feature vector in ith classes. Note that, different distance metrics cause changes in classification performance. Although, different approaches can be used as distance metrics such as city-block (L1), sum of squared difference (SSD) or Minkowski (L∞) distance, in general Euclidean distance (L2) is often preferred by researchers [9].
General form of sum of squared distance (SSD) is expressed in (29) as follow: and Euclidean distance can be denoted as: and finally optimum classification decision becomes: In (31), m is the number of classes where belongs to each k-NN vector. During classification, the distance between x vector and k nearest neighbor features are calculated and which classes belong to each vectors are also determined. In decision stage, x is marked to class Copt which have the highest number of member with k features.

Discriminant Classification (LDA)
This method consists of two major title that are projection of data and determining of decision boundaries. Actually, the main objective of this method is to perform dimensional reduction. In this classification scheme, data is divided into sub-spaces or another hyperplanes, which have best class discriminatory information by the help of produced projection vectors. Hence, the best classifier performance is obtained at optimal decision boundaries. Also, during this process, it is proposed to preserve the class information as much as possible. Because of these purposes, in this subtitle, "what are the best projection vectors for discriminant classifiers" is explained in term of discriminant classifier background.
For c-classes cases, mean and covariance matrices are calculated with ignoring membership weights. Then, global mean is calculated as follow: where ω is default weight vector. In order to maximize j(ω): Taking derivative of (35) with respect to ω and setting it zero, we obtain: If (37) is rewritten after some derivation steps, it becomes eigenvalue-eigenvector problem as shown in (38) where ωi is weight vector and ωi0 denotes bias or threshold for ith class. After some simplifications, the coefficient of quadratic term becomes: and weight vector: While calculating decision boundaries, for each classes, gi(x)-g2(x)=0 hyperplanes are plotted. In this case, assuming that all data consists of two classes, it means that g2(x) is processed with participation of all data only except ith class. As a result, decision boundaries are calculated for dataset1. The obtained results are illustrated in Fig. 9.

Conclusion
In this paper, both dataset1 and dataset2 are modelled with GMM and EM algorithm is implemented in order to predict distribution parameters as well as estimate number of feature vectors for each classes. After that, performances of classifiers are discussed and compared. Also, in CSM analyzes for both datasets, best separable classes which are class-1 and class-2 are evaluated, hence they have arbitrary covariance values and their means are farther away from each other. In addition, to investigate the classifier performance further, k-NN classification method is implemented. It is observed that k-NN classification obtains maximum accuracy when k=9 and k=5 respectively for dataset1 and dataset2.
Consequently, according to all evaluation results, the best overall accuracy is observed in k-NN classification for dataset2 and the best accuracy is achieved with SVM classification for dataset1.
Furthermore, it is also determined that the classification performance decreases when dimension of space increases. Finally, when the effect of amount of training data is investigated on classifier performance, regardless of the type of classifiers it is observed that, the performance of all classifiers decrease when amount of training data increases.