Research on Brain Signals via Artificial Neural Network and Swarm Intelligence Algorithms

: Artificial Neural Networks (ANNs) that are the ability to learn from their environment in order to improve their performance are widely used in numerous applications. The Backpropagation (BP) Algorithm is one of the most popular and effective model of ANNs. However, since it uses gradient descent algorithm, which attempts to minimize the error of the network by moving gradient of the error curve, easily get trapped at local minima. In order to avoid this problem and to obtain a better classifier, we proposed an ANNs and Swarm Intelligence (SI) method where Artificial Bee Colony and Particle Swarm Optimization algorithms were operated for the Multilayer Perceptron Neural Network. Two Electroencephalogram (EEG) datasets were used to test to test the accuracy and success of the study performed. Compared to conventional-MLPNN, higher success values were obtained on each dataset with the proposed methods. Experimental results demonstrate that combined SI and MLPNN algorithm has been increased the success of BP algorithm by avoiding local minima. For ABC data, respectively, ABC-MLPNN and PSO-MLPNN methods, 79.00% and 75.50% respectively for Boston data and 91.67% and 88.33% respectively for Selcuk data were obtained. On the other hand, with the MLPNN algorithm, the success rate was 68.50% for Boston data and 81.67% for Selcuk data. These results show that the success of MLPNN algorithm significantly increases with the weights obtained by using SI. In addition to this, this study showed that the SI-MLPNN algorithm can be used on non-linear and highly complex EEG data. swarm


Introduction
Artificial Neural Networks (ANNs) can effectively make a decision about the class of the signal. Therefore, neural networks have been successfully applied for so many medical applications [1]. Spectral analysis is a well-known method for analyzing EEG signal (EEGs). Nowadays ANNs may offer a superior performance for analysis of EEGs, compared to the spectral analysis methods [2]. On the other hand, the Backpropagation (BP) Algorithm which is a technique of ANNs falls into the problem of local minima because of uses gradient information. To avoid this problem, a host of other algorithms have been explored for Multilayer Perceptron Neural Network (MLPNN) training. Swarm Intelligence-based techniques can be used in a number of applications such as controlling unmanned vehicles, self-assembly and interferometry, planetary mapping, controlling nanobots within the body, killing cancer tumors and data mining. Swarm Intelligence (SI) algorithms inspired from nature are one such alternative. Artificial Bee Colony (ABC) and Particle Swarm Optimization (PSO) are some of the SI algorithms that have been used for training ANNs. However, there is currently no intelligent theory based on the complexity of the diagnostic disease. Up to now, no study has been reported in literature related to SI-based MLPNN classification for analysis of EEGs. EEG signals are non-linear signals that are quite difficult and complex to interpret in biomedical engineering. Another important contribution of this study is that the feature vector, which is extracted by different methods of extraction, is reduced by the eigenvector method to provide a faster and robust structure. The rest of this paper has been organized as follows: Section 2 describes a brief overview the related materials and methods. The proposed method has been explained in Section 3, which is then followed by the performed experiments and obtained experimental results for the proposed methods has been explained in Section 4. Finally, we summarized the most relevant conclusions and discussion of this work in Section 5. The first stage of signal processing is the pre-processing that involves transforming raw data into an understandable format. This process, which is commonly used as a data mining application in advance, transforms data into a format that can be processed more easily and effectively for the user. Pre-processing usually contains sampling, denoising, normalization, filtering, artifact rejecting, etc. Feature extraction/selection is very important role for classification methods. Feature extraction is the determination of a feature or a feature vector from a pattern vector. For pattern processing problems to be tractable requires the conversion of patterns to features, which are condensed representations of patterns, ideally containing only salient information [3]. In the feature extraction stage, numerous different methods can be used so that several diverse features can be extracted from the same raw data. Feature selection methods provides us a way of reducing computation time, improving prediction performance, and a better understanding of the data in machine learning or pattern recognition applications [4]. The focus of feature selection is to select a subset of variables from the input which can efficiently describe the input data while reducing effects from noise or irrelevant variables and still provide good prediction results [5]. The last step in signal processing, classification process is to check the identity of the input vectors according to the feature vectors stored in the database. Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. The simplest type of classification problem is binary classification. In binary classification, the target attribute has only two possible values. In the model build (training) process, a classification algorithm finds relationships between the values of the predictors and the values of the target. Different classification algorithms use different techniques for finding relationships. These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown [6].

Electroencephalogram
Epilepsy is a disease where neurological disorders stem from temporary abnormal discharges in the brain's electrical activity [7], [8]. The crucial characteristic of this disease is repetitive seizures. These seizures may sometimes not be noticed [9]. Electroencephalogram (EEG) is a clinical monitoring tool that the records electrical activity of the brain signals which contain valuable information for understanding epilepsy. The detection of seizures occurring in the EEGs are an important component in the diagnosis and treatment of epilepsy. In general, long-term EEG monitoring records EEG for periods longer than routine EEG recordings of 20 to 30 minutes. However, interpretation of EEGs is a time consuming and expensive process because it involves large amounts of data. Large amounts of data are generated by EEG monitoring systems for electroencephalographic changes, and their complete visual analysis is not routinely possible. Prolonged EEG that lasts hours to days creates a large volume of EEG for interpretation. Computer-assisted analysis has become widely available to allow review of selected EEG during specific times of interest [10]. Therefore, developing automatic seizure detection methods is of great significance for reviewing EEGs [11]. In the last two decades, many researchers addressed to this problem. Fu [12] performed the classification of seizures based on the time-frequency imaging of EEGs using the Hilbert-Huang Transform (HHT) and Support Vector Machine (SVM) with Radial Basis Function (RBF). Joshi [13] carried out classification of EEGs using fractional linear prediction. A comparative study of wavelet families for EEGs classification was performed by Gandhi [14] using a Probabilistic Neural Network (PNN) with SVM. Lee [15] proposed a Neural Network with Weighted Fuzzy Membership (NEWFM) to classify EEGs, while Aydın [16] developed a classification using a Multilayer Neural Network (MLNN) architecture with respect to several time domain entropy measures on EEG series. A hybrid PSO integrating neural network with a fuzzy membership function (NEWFM) technique was proposed for epileptic seizure classification tasks by Abuhasel [17]. The work of Satapathy [18] analyzes the epileptic disorder in the human brain through EEGs analysis by integrating the best attributes of ABC and radial basis function networks (RBFNNs). Dehuri [19] employed for some Unique Client Identifier (UCI) data which include fisher iris, pima-diabetes and shuttle data classification using ABC trained MLPNN. Ground activity is well developed in posterior regions of hemispheres and contains 8-13 Hz alpha waves as shown in Figure 2. The neurologist interpreted that there was no significant asymmetry between the hemispheres in clinical information. Furthermore, slow-wave paroxysms were observed that condensed in the central areas at infrequent intervals along the trace.

EEG Dataset
In order to investigate the classification accuracy of the proposed methods, the data sets taken from Boston Children's Hospital and Selcuk University Medical Faculty Hospital were used. The first dataset collected from Boston Children's Hospital consists of EEG recordings from pediatric subjects with intractable seizures. Subjects were monitored for up to several days following withdrawal of anti-seizure medication in order to characterize their seizures and assess their candidacy for surgical intervention. Recordings, grouped into 23 cases, were collected from 22 subjects. All signals were sampled at 256 samples per second with 16-bit resolution. Two hundred recordings were included in this study, the first 100 recordings of which belonged to 22 seizure and non-epilepsy people in Boston Children's Hospital [20]. The second dataset has been obtained from Department of Neurology of Selcuk University Hospital, retrospectively. The EEGs taken from surface on brain were carried out on 60 patients using the 10-20 international system of electrode placement. The study used recordings belonging to 60 patients (

Discrete Wavelet Transform
Recently, many nonlinear and nonstationary methods [21], [13] have been suggested to extract signal processing parameters. The wavelet transform (WT) has been found to be particularly useful for analyzing signals that can best be described as a periodic, noisy, intermittent, and transient and so on. Its ability to examine the signal simultaneously in both time and frequency in a distinctly different way from the traditional Short-Time Fourier transform (STFT) has spawned an ever-increasing number of sophisticated wavelet-based methods for signal manipulation and interrogation. Wavelets are used to transform the signal under investigation into another representation which presents the signal information in a more useful form [22]. The main advantage of the WT is that it has a varying window size, being broad at low frequencies and narrow at high frequencies, thus leading to an optimal time-frequency resolution in all frequency ranges [23]- [28]. Discrete wavelet transform (DWT) is generally used because the calculation of wavelet coefficient ratios at every possible scale requires a great deal of effort and can result in a large amount of data [8]. Wavelets provide a time-scale information of a signal, enabling the extraction of features that vary in time [29]. DWT Central areas-SLOW WAVE analyzes the signal at different frequency bands and different resolutions by decomposing the signal into a coarse approximation and detailed information. DWT employs two sets of functions, called scaling functions and wavelet functions, which are associated with low-pass and high-pass filters, respectively. The decomposition of the signal into different frequency bands is simply obtained by successive high-pass and low-pass filtering of the time-domain signals [30].

Statistical Feature Extraction
Feature extraction plays an important role in pulling out special patterns (features) from the original data for reliable classification. The feature extraction stage must reduce the original data to lower dimensions that contain most of the useful information included in the original vector. It is therefore necessary to find out the key features that represent the whole dataset, depending on its characteristics [31]. Some statistical features are extracted from the data of each channel as the most representative values to describe the original signals. The following ten statistical features of each channel of EEG data are used as valuable parameters in the representation of the characteristics of the original EEGs.

Principal Component Analysis
Principal Component Analysis (PCA) is the one of the most known methods for dimension reduction. The goal of PCA is to reduce the dimensionality of the data while retaining as much as possible of the variation present in the original dataset [32]. PCA transforms a high-dimensional dataset (of m dimensions) to a low-dimensional orthogonal feature (Eigenvector) space (of n dimensions, m > n) while retaining the maximum variance of the original high dimensional dataset. Each resulting orthogonal feature is referred to as a Principal Component (PC). Eigenvalues are scalar representations of the degree of variance within the corresponding PCs. PCs are ranked by their corresponding eigenvalues, and thus, the first PC captures the most significant variance in the dataset. The second PC is perpendicular to the first PC and it contains the next significant variance [33]. It is mostly useful for segmenting signals from multiple sources such as EEGs. The knowing of number of independent components in advance is very useful. Therefore, we preferred PCA that is one of the best dimension reduction methods for reduction the number of features on the second stage of signal processing.

Artificial Neural Network
ANNs are an information-processing system that is based on a simulation of the human cognitive process. In ANNs, knowledge about the problem is distributed through the connection weights of the links between neurons as shown in Figure 3. The neural network has to be trained to adjust the connection weights and biases in order to produce the desired mapping.
ANNs are widely used in the biomedical field for modeling, data analysis and diagnostic recognition. The ANN's capability to learn examples, the ability to reproduce arbitrary non-linear functions of input, and the highly parallel and regular structure makes them especially suitable for pattern recognition problems [35]. Multilayer perceptron neural network (MLPNN) is the most commonly used feedforward neural networks due to their fast operation, ease of implementation, and smaller training set requirements. The MLPNN consists of three sequential layers: input, hidden and output layers as shown in Figure 4.

Fig.4. Multilayer Artificial Neural Network
The hidden layer processes and transmits the input information to the output layer. [36]. The training algorithm is an important part of the ANNs model. A good topology can be inefficient if trained by an inappropriate algorithm. A suitable training algorithm has a short training process while achieving better accuracy. There are many training algorithms used to train MLPNN and one of the most commonly used is Bayesian regularization BP, which is also used in this work. This algorithm updates the weight and bias values according to Levenberg-Marquardt optimization. It minimizes a combination of squared errors and weights, and then determines the correct combination to produce a network that generalizes well. The process is called Bayesian regularization [37].

Calculation of classification performance
The most straightforward way to evaluate the performance of classifiers is based on the confusion matrix analysis [38]. A confusion matrix contains information about actual and predicted classifications done by a classification system. Performance of such a system is commonly evaluated using the data in the matrix [39]. Given a classifier and an instance, there are four possible outcomes. If the instance is positive and it is classified as positive, it is counted as a true positive (TP); if it is classified as negative, it is counted as a false negative (FN). If the instance is negative and it is classified as negative, it is counted as a true negative (TN); if it is classified as positive, it is counted as a false positive (FN) [40]. The evaluation of the proposed methods in classification problems is determined by computing the statistical parameters. Sensitivity (SEN), specificity (SPE) and classification accuracy (CA) values are calculated in Equals 1, 2 and 3: (1)

Proposed Methods
SI is the name given to a relatively new interdisciplinary field of research, which has gained a wide popularity in recent times. Algorithms belonging to this field, draw inspiration from the collective intelligence emerging from the behavior of a group of social insects (like bees, termites and wasps). These insects even with very limited individual capability can jointly (cooperatively) perform many complex tasks necessary for their survival [41]. Ant Miner, ABC and PSO that are SI methods are frequently used in classification problems and are obtained successful results [19].

Artificial Bee Colony
The colony of artificial bees consists of three groups of bees: employed, onlookers, and scout bees. The employed bees are those, which randomly search for food-source positions.
Onlookers are those bees waiting in the hive's dance area. In the ABC algorithm, onlookers and employed bees perform the exploration process in the search space, while on the other hand, scouts control the exploration process [42]. In ABC every bee explores a possible solutionin the current case the optimum weight matrix for the given network configuration. The training error err(x) is used as the fitness values; this indicates the extent of conformance of the network output with actual output. Minimizing the error (fitness value) will lead to the best set of weights for the given network configuration and the network is said to be trained. For any classifier, its performance is dependent on the chosen loss function. Selection of proper loss function err(x) for a given problem is often difficult. Different classification techniques in machine learning employ different loss functions to get better classification accuracy. Commonly neural network classifiers employ means square error (mse) minimization or cross entropy.
In or study, we use the loss function such as root mean sum of squared residuals (error) in the training data as the fitness values of the ABC. This serves as a qualitative performance measure of the network learning and is given in Eq. 4 [19].
where yi is the activity level of the i th node in the top layer and di is the desired output of the i th node. Our objective is to minimize this fitness value. At each time step the randomness amplitude and speed of convergence of each bee are changed towards its food source. The random factor prevents the swarm getting stuck in the wrong place and speed of convergence is used to identify the rate at which bees converge to a solution. Training basically involves presenting the training samples as input vectors through a neural network, calculating the error of the output layer, and then adjusting the weights of the network to minimize the error [19].

Particle Swarm Optimization
PSO that is an evolutionary optimization algorithm suggested by Kennedy and Eberhart in the mid-1990s and it attempts to simulate the movements and choreographies of the birds. PSO can be incorporated into the classification methods because it is robust and adaptable. On the other hand, modified PSO with w that is the inertia weight tries to keep away from the local minima [43]. Inertia weight is an important parameter in PSO, which significantly affects the convergence and exploration-exploitation trade-off in PSO process [44]. Big and small values of inertia variable result in the avoidance of local minima [45].
Firstly, the total weighted input xj is calculated using Eqs. 5, 6, and 7 for the PSO-MLPNN method. X indicate the particle position and V indicate the particle velocity. Pi and Pg are the particle best (pbest) and global best (gbest), respectively. The term i shows the particle index, and t is the time step. Rand() denotes a normally distributed one-dimensional random number with a mean of zero and a standard deviation of one. Parameters c1 and c2 are the cognitive and social learning rates [46], where w is the inertia suggested by [43]. In Eq. 7, yi is the activity level of the j th node in the previous layer and wij is the weight of the connection between the i th and j th node. Secondly, the node computes the activity yj using some function of the total weighted input. In this work, we used the sigmoid function since the error rate is less than the others.
The error (E) when the activities of all output nodes have been determined is computed, which is defined by the expression: where yi is the activity level of the i th node in the top layer and di is the desired output of the i th node. In our study, the MLPNN and SI methods were combined to prevent the MLPNN from getting out of the trap and to ensure high classification success. For proposed methods ABC-MLPNN and PSO-MLPNN, after randomly assigning initial weights, activation function values were calculated, and training of the network was continued using the BP with new weights until the stop criteria were achieved. Input values and hidden layer neuron numbers of the network (consisting of one hidden layer) determine the dimension (N) of the ABC and PSO. The block diagram of the MLPNN method combined using SI is shown in Figure 5. The weights obtained by using SI between input-hidden layer and hidden output layer are represented by WSI. A sigmoid function was used as the activation function of the input-hidden layer and a linear function was used as the activation function of the hidden output layer.

Experimental Results
The success of combined proposed methods was compared with the performance of a conventional-MLPNN method by using two different EEG datasets, of which one were publicly available (Boston Children's Hospital) and the remaining dataset (Selcuk University Hospital) was obtained retrospectively. For best model selection in the classification problems described, a ten-fold, cross-validation technique was used [47]- [49]-where ten-fold, cross-validation is the most common in data mining and machine learning. Afterwards, all of the datasets values underwent a normalization operation before being trained by pre-processing as follows: where xs is the value of the s th segment to be normalized, and xmax and xmin are the maximum and minimum values of the data. A ten-fold cross-validation technique was used for the best model selection in the defined classification problems [47]- [49], with a ten-fold cross-validation technique being the most common in data mining and machine learning. Meanwhile, the target values are kept as 1 and 0, which in turn, represent epileptic activities and non-epilepsy EEGs respectively. The present model in this study consists of three following steps as mentioned in section of Materials and Methods: 1. Feature Extraction: The EEGs, consisting of many data points, can be compressed into a few features by performing spectral analysis of the signals with the WT. These features characterize the behavior of the EEGs.
Using a smaller number of features to represent the EEGs is particularly important for recognition and diagnostic purposes [25], [26], [50]. DWT is one of the nonstationary methods used for extracting features. The determination of convenient wavelets and the number of levels of decomposition is very significant for signal analysis. After arrangement of datasets, feature extraction must be performed so that the same dataset can be identified with fewer features. The obtained feature vector can sometimes be used directly or a feature selection process can be carried out to decrease the number of features. Firstly, the fourthorder Daubechies, which has proven to be the most appropriate wavelet function for epileptic EEG analysis wavelength, has chosen for wavelet function in this study. [51]. The number of decomposition levels must chose according to on the effective frequency components of the signal [52]. The fourth order Daubechies wavelet that is wavelet function was used to decompose for sub-band in the time domain using fifth level. EEG recordings were divided into sub-band frequencies such as delta (δ), theta (θ), alpha (α) and beta (β) by using DWT. Then a set of power features in time-domain was extracted from the wavelet sub-band frequencies δ (0-4 Hz), θ (4-8 Hz), α (8-16 Hz) and β (16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32). Statistical features are indicative factors in determining qualities of signs. Thus, a second feature vector was obtained by subtracting 10 different statistical features from all channels on each dataset. Then, DWT and statistical properties were combined and unique feature vector was constructed. All feature vectors were computed using the MATLAB (Version 7.11, R2010b) software package to achieve results faster and more accurately. 2. Feature Selection: Selection of the proposed model inputs is very important for success of the classifier [21]. In that context, if the number of inputs is selected unnecessarily high, performance of the system might decrease because of the difficulty of the calculation on the network. However, if the number of inputs is selected unnecessarily low, the system may not give the result accurately and positively. Therefore, selection of the number of inputs is very significant for these systems. There is no doubt that, PCA that is a sophisticated method reduces the size of features [53]. PCA method was operated for feature selection. New feature vectors (eigenspace) which are given in the "Feature Selection" section of Table 1 was determined by using PCA. By considering the relationship between features through feature selection, those features with the most significant relationship create a new eigenvector. The eigenvectors present in this table were used to determine the success of classification methods as a feature vector. 3. Classification: Signal processing is to determine the success of the system using classifiers. The BP algorithm has difficulties in handling local optima and cannot yield optimal adjustable weights for MLPNN. For the MLPNN method, which has a single hidden layer, the number of different numbers of neurons has been tested and twelve neurons were chosen as optimum neuron numbers after attempts [19]. The best weights and optimum number of neurons have been investigated operating a program written in Delphi language. The main difference between the network types lies in the type of activation function used by the hidden neurons. In MLPNN, a common type of activation function used by the hidden neurons has a sigmoid function. Neurons in the output layer usually have linear transfer functions [54]. Classification success was also investigated by using a sigmoid transfer function as shown in Figure 6. Thus, high classification success was aimed for by choosing the best network structure. Linear and sigmoid functions used as the output function were tried for some folds of the present datasets, and their performances were examined.

Activation Function Mathematical Equation 2D Graphical Representation
Linear Sigmoid (Logistic) Fig.6. The activation functions used in proposed methods [55].
According to the data obtained, there was no remarkable difference between them; however, the results obtained with the linear transfer function were slightly better than the sigmoid function. So, the linear transfer function has been performed for all other data. The simplest stop criterion for active learning is when the training set reaches the desired size or a predefined threshold. In this context, we examined each stop criterion methods for ending classification. In this work, the termination criterion is mostly obtained with maximum iteration for all classification methods. Afterwards, we separately calculated SEN, SPE and CA values (Equals 7, 8 and 9), which are statistical measures of the performance of a binary classification test for all methods. Classification successes for each method are shown in Tables 2, 3 and 4, respectively. The confusion matrix of Fold 1 for Boston dataset that s performed via ABC-MPLNN is given in Table 5 Table 6. As seen in Table 6 and Figure 7, the ABC-MLPNN method shows the highest success rate (91.67%) with the Selcuk dataset while the PSO-MLPNN and MLPNN methods show a success rate of 88.33% and 81.67%, respectively, on the same dataset. Consequently, the CA of ABC-MLPNN and PSO-MLPNN methods is higher than the CA of MLPNN method for both Selcuk dataset and Boston dataset. Therefore, it can be clearly seen that the successes of the proposed methods are higher than the success of a conventional-MLPNN method. As a result, it was determined that the success of ANN combined with SI techniques, such as ABC and PSO, was higher than that of the conventional-ANN method. The performances of the proposed methods on different datasets were investigated, and it was seen that the success of SI-MLPNN methods were higher than that of the conventional-MLPNN method.

Conclusions and Discussion
Automated detection of EEGs in normal and ictal situations is important in the field of epileptic activities. ANNs have been widely used by researchers to classify the EEG signals [56].
There is no doubt that the ANN algorithm is one of the most commonly used algorithms to test the accuracy and success of a system's operation. However, BP usually traps in local minima. This greatly affects the success and efficiency of the system. SI has been extensively used for training neural network because of the stochastic nature of the algorithm, which makes it very robust and flexible. Therefore, we investigated the combined ABC-MLPNN and PSO-MLPNN methods for avoid this problem and tested the successes of proposed methods and conventional-MLPNN method on two different EEG datasets. As mentioned in the experimental results section, both the satisfactory accuracy of the proposed methods by performing ABC-MLPNN and PSO-MLPNN but also a suitable eigenvector feature set was found. We suggest that the proposed methods contribute to the detection of signals indicating brain disorders and that long-term EEG recording that are difficult to interpret can be used to diagnose epileptic activities using techniques such as embedded systems. The proposed methods can provide valuable contributions to the neurologist in treatment and diagnosis. In addition, these methods, to be further developed into a user-level program, allow fast and robust classification of EEGs.