K-Means Clustering Algorithm Based Arrhythmic Heart Beat Detection in ECG Signal

— Disorders in the functions of the heart cause heart diseases or arrhythmias in the cardiovascular system. Diagnosis of cardiac arrhythmias is made using the Electrocardiogram which measures and records electrophysiological signals. In this study, a three-class, K-means clustering-based arrhythmia detection method was proposed, distinguishing the cardiac arrhythmia type Right Bundle Branch Block and Left Bundle Branch Block from normal heartbeats. Data from the MIT-BIH Arrhythmia Database were analyzed for clustering-based arrhythmia analysis. Feature Set 1 (FS1) was created by extracting the features from the Electrocardiogram signal with the help of QRS morphology, Heart Rate Variability and statistical metrics. The RELIEF feature selection algorithm was used for dimension reduction of the obtained features and Feature Set 2 (FS2) was obtained by determining the most appropriate features in FS1. Overall performance results for FS1 were 99.18% accuracy, 98.78% sensitivity, and 99.39% specificity, while overall performance results for FS2 were 95.37% accuracy, 92.99% sensitivity and 96.54% specificity. In this study, the computational cost was decreased by reducing the processing complexity and load, utilizing the reduced feature data set of FS2 and an arrhythmia detection method having a satisfactory level of high performance was proposed.


I. INTRODUCTION
LECTROCARDIOGRAM (ECG) measures and records cardiac activity through electrical signals that are taken from the human body by non-invasive methods. An ECG shows the bioelectric activity of the heart. Fig. 1 depicts the waveform of a single heartbeat on the ECG recording [2]. The ECG signal is used by specialists to monitor the normal activity of the heart [1]. The depolarization and repolarization functions that occur as a result of stimulation of the heart muscle generate the ECG signal and periodic waves called the PQRST pattern (see Figure 1). Fig.1. The waveform of the ECG signal [2] The time intervals and/or waveform changes of a normal heartbeat shown in Fig. 1 may indicate the presence of various cardiac disorders. One of these heart disorders is an arrhythmia, a disorder of the sinus rhythm that is the normal heart signal. All of the problems derived from the generation or conduction of electrical excitation that provide the heart contraction are all collectively called "arrhythmias" [3]. A healthy heart contracts 60-100 times per minute.  [4] In the precordial leads (V1-V2-V3) showing the right side K-Means Clustering Algorithm Based Arrhythmic Heart Beat Detection in ECG Signal Ö. YAKUT, E. DOĞRU BOLAT and H. EFE  of the heart, rSR' (prolonged S wave in V6) pattern or QRS complexes resembling an M shape are seen in Fig. 2 [4]. Fig. 2 shows the Right Bundle Branch Block (RBBB) beat example. The left ventricle is normally activated with the impulses travelling through the left bundle branch, and then the conduction passes to the Atrioventricular Node (AV node) and the right ventricle is activated with delay. The RBBB indicates a problem in the electrical conduction system of the heart. For a diagnosis of RBBB from ECG data the following signs should be present: the QRS duration should be over 120 ms; the S waves should be expanded in the V5-V6 band; there should be an M-shaped QRS complex; and there should be T wave inversion and ST depression in V1-V2-V3. Although RBBB may be present in healthy individuals, the rate of incidence of RBBB increases with age [4,5].  [8] There can be many changes in the ECG signal components. The QRS and T waves are reversed in the direction as shown in Fig.3 depicting Left Bundle Branch Block (LBBB). In leads where the QRS wave is positive, the T wave may be negative. Negative T waves are abnormal in leads with negative QRS. The QRS complexes in the lateral leads may be in the form of an M-shaped, notched rS complex shown in Fig.3 [8]. LBBB is very important, since it can mask the diagnosis of myocardial infarction. In LBBB, the depolarization of the ventricles is delayed because the left ventricle is more slowly depolarized after depolarization of the right ventricle due to slower than normal right-to-left interventricular septal conduction [6]. Typical ECG parameters found in LBBB include: QRS duration of 120 ms or more; a large notched Mshaped R wave seen in DI, aVL, V5, V6; T and Q waves are on opposite sides; and no Q waves are seen in V5, V6 [7].
There are studies on arrhythmia diagnosis using clustering or classifier based algorithms in the literature. The study by Akdeniz [9] is of two phases. In the first stage, it was determined whether arrhythmia was present or not on the ECG data. In the second stage, classification was made of the arrhythmias that were detected. Time-frequency features were derived using various transformation techniques [9]. In the study by Ersoy [3], acceptable successful classification results were obtained using a multi-layered, artificial neural network that detected whether there was an arrhythmia symptom on the ECG signal [3]. Doğan et al. [10] proposed the Vortex Search algorithm for fuzzy-based clustering of arrhythmia in the ECG signal using a metaheuristic approach. Donoso et al. [11] conducted a study of Atrial Fibrillation (AF) Clustering based on the ECG signal. In this study, they presented work which identified AF subclasses using K-means and hierarchical clustering algorithms [11]. Suganthy [12] in the proposed Fuzzy C Means (FCM) method calculated the heart rate from foetal electrocardiogram signals using clustering and grouping of the R peaks. Wang et al. [13] proposed a type 2 FCM algorithm used to determine early ventricular beats in the ECG signal. In the study by Yücelbaş [14], six different arrhythmia types were identified using hybrid classifiers formed by using multiple systems together. Features were prepared by applying wavelet transform on the data available from the Massachusetts Institute of Technology-Beth Israel Hospital (MIT-BIH) database [14]. Zhang et al. [15] extracted the wavelet features and classified electronic waveforms using the K-means clustering with a Hamming distance. Dallali et al. [16] proposed a classification of arrhythmias using wavelet transform, Heart Rate Variability (HRV) and FCM clustering. Hilavin et al. [17] suggested a K-Nearest Neighbour (KNN) based classification of five types of arrhythmias using the spectral features of the ECG signal. Yeh et al. [18] proposed a cardiac arrhythmia method, which was composed of three main stages using the FCM algorithm in ECG signals. Mohebbanaaz et al. [19] provided a summary of various studies in the literature, such as pre-processing, feature extraction, feature optimization and classification, used when classifying the arrhythmias in ECG signals. They compared performance analysis and methods. Yeh et. al. [20], proposed a method for analyzing the ECG signal to diagnose cardiac arrhythmias using the Cluster Analysis (CA) method. Korürek et. al. [21], classified ECG arrhythmias taken from the MIT-BIH Arrhythmia Database using Ant Colony Optimization (ACO) based cluster analysis. Jekova et. al. [22] compared the learning capacity and classification abilities of the KNN, Neural Networks (NN), Discriminant Analysis (DA) and Fuzzy Logic (FL) methods for arrhythmia classification. Christov et. al. [23] proposed Morphological Descriptors (MD) and Time-frequency Descriptors (TFD) techniques for heartbeat classification to extract heartbeat characteristics from ECG records.
In this study, using a novel clustering approach, a highly successful algorithm distinguishing normal, LBBB and RBBB beats of the ECG signal is described. Additionally, the computational load was decreased through reduction of the size of the feature data set.

A. ECG Signal
ECG signals were obtained from the public MIT-BIH Arrhythmia Database [24]. In this database, there are 48 ECG recordings each of which is 30 minutes long. The ECG signal has ± 5mV amplitude and is sampled at 360 Hz. In the MIT-BIH Arrhythmia Database, a dataset of the beat types shown in Table 1 was obtained using annotation files containing the location information of the R-peaks of the ECG signals. In the MIT-BIH Arrhythmia Database, record 100 for normal (N) beats, record 109 for RBBB (R) beats and record 118 for LBBB (L) beats were used. The number of beats with the related record numbers is given in Table 1.

B. Feature Extraction
In this study, the location of the R peaks in the ECG signal was obtained from the annotation files. Thus, the features including meaningful information about the heartbeats were extracted using these R-peaks and these features were used for the diagnosis of the arrhythmia types in the heartbeats of the clustering algorithm.
In the proposed system, a 50-sample window (24 samples before R peak and 25 samples after R peak) was located inside the R peak window. Thus, the locations of Q and S points were determined. The amplitudes of the QRS complex components and the QRS complex width were used as a feature.
Using the R-peak locations, the time interval (RR(n)) between two consecutive R-peaks was calculated. HRV is the change between these intervals [1]. RR(n) is calculated as in Equation (1) [1].
The feature of the selected beat is calculated as RRnorm(n), as in Equation 2, by normalizing the obtained RR(n) intervals in Equation (1)   (2) RRnorm(n) represents the normalized RR(n) interval, n represents the index of the corresponding RR(n) interval location. Other features were calculated using the statistical and HRV metrics of the study by Yakut et al. [1]. In this study, feature vector Feature Set 1 (FS1) including 15 features was created as a result of the feature extraction process.

C. Feature Selection
In this study, the RELIEF feature selection algorithm proposed by Kira and Rendell [25] was used. RELIEF is a feature weight-based algorithm and inspired by instance-based learning [25]. The algorithm calculates the weights of the features using the input matrix and the output vector and performs the ordering according to the weights. When sorting, the features are weighted in order of importance, giving the values between -1 and +1 to the weights of the corresponding features [27].
In this study, five of the 15 features which were used in the RELIEF algorithm were selected as RRnorm, RRmedian, QRSwidth, Qamplitude, and Samplitude, respectively, and feature vector Feature Set 2 (FS2) was generated.

D. K-Means Clustering
In this study, the K-means clustering algorithm, which is well described for solving clustering problems and based on unsupervised learning, was utilized. K-means is a simple and easy clustering algorithm classifying a given data set as input as a defined number of clusters (k clusters) [28].
The K-means clustering algorithm mainly consists of the following steps [28]: Step 1 -The k-points are placed in the area represented by the clustered objects to form the initial group centroids; Step 2 -Each object is assigned to a group that has the closest cluster center, depending on the distance metric used; Step 3 -When objects are assigned to groups, the positions of group centers of k points are recalculated; Step 4 -Repeat Step 2 and Step 3 until the positions of the group centers remain constant. When group centers are fixed, all objects are grouped according to the distance metric used.

III. EXPERIMENTAL STUDY
In this study, in the K-means clustering algorithm, the distance between objects and cluster centers was calculated using the Euclidean length metric. The proposed system was implemented using Matlab R2015b (The MathWorks, Inc., Natick, Mass., USA) software. The block diagram of the proposed system is shown in Fig.  4. The annotation file containing the locations of R-peaks of the ECG signal was obtained from the MIT-BIH Arrhythmia Database [24]. The locations of the obtained R-peaks were used and FS1 was generated by extracting the features to diagnose arrhythmia from these peaks. These extracted features were subjected to a feature selection process using a size reduction algorithm and FS2 was created. Both of the generated feature data sets were applied to the clustering algorithm. In this study, the K-means clustering algorithm was used to distinguish Normal, LBBB and RBBB heartbeats from each other. The performance of arrhythmic heartbeat detection was assessed.

B. Performance Metrics
In the proposed system, the following metrics were used to measure the success of detecting arrhythmias in heartbeats of ECG signals [28,29]. Accuracy: Indicates the success of correctly detected beats belonging to a particular cluster. It is calculated as in Equation (3).
Sensitivity: It is the ability to accurately determine True Positive (TP) beats that belong to a particular cluster. It is calculated as in Equation (4).
Specificity: It is the ability to accurately determine True Negative (TN) beats that belong to a particular cluster. It is calculated as in Equation (5).  [28,29].

IV. EXPERIMENTAL RESULTS
In the proposed system, cardiac arrhythmia of the ECG signal was grouped using FS1 and FS2 data sets and K-means clustering algorithm, and the performance results of the proposed system are shown in Table 2 and Table 3. FS1 overall performance results are given in Table 2 and were: accuracy 99.18%; sensitivity 98.78%; and specificity 99.39%. The FS2 overall performance results are shown in Table 3 and were: accuracy 95.37%; sensitivity 92.99%; and specificity 96.54%, respectively. During simulation studies, the clustering process was repeated five times to ensure the reliability of the results. In this study, the dimensionality reduction process was applied. Five features were selected from 15 features and the related performance results are given in Table 2 and Table 3. In the proposed system, the graphs, shown in Fig. 5 and Fig.  6 illustrate the RRmedian and RRnorm features for the FS1 and FS2 feature data sets, respectively. The cluster centers and distribution of heartbeats versus cluster centers are shown in Fig. 5 and Fig. 6. It is observed that the beats are clustered according to arrhythmia types in the ECG signal. The results shown in Tables 2 and 3 indicate that a highperformance system has been developed to detect arrhythmias due to either RBBB or LBBB in the heartbeat records of ECG measurements. However, when Fig. 5 and Fig. 6 are compared, it was observed that the heartbeats grouped in one cluster in Fig. 5 are grouped under other clusters in Fig. 6. This was due to some heartbeats not being sufficiently represented because of the dimensionality reduction of the features, which negatively affected the ability of the K-means clustering algorithm to achieve a correct prediction in some situations.
The FS1 data set consisted of 15 features. As a result of the feature selection process with the RELIEF method, five of these features, with high discrimination, were identified and the FS2 data set was created. Thus, because of the feature selection process, the FS2 data set was reduced to one third of the original FS1 data set. Reducing the size of the data set reduced the computational cost and therefore the computational load of the machine learning method. Since the size of the data set was reduced, the proposed method will consume less system resources, have a lower computational load and thus computational complexity will be reduced. This reduction in the size of the FS2 data set yields satisfactory results in terms of processing load and complexity. However, it was observed that the performance results are lower with a slight difference when compared to the more detailed FS1 data set. The obtained performance results show that the proposed system diagnoses heart arrhythmias at a satisfactory level. Table IV shows a comparison of the proposed method described herein and classification methods in the literature [20][21][22][23]. When the results in Table IV are examined, it is evident that the FS1 data set has the highest sensitivity value when compared with cluster-based algorithms [20,21]. In addition, when the results of the FS1 dataset are compared with previous classification-based algorithms [22,23], they appear to be in the same range and give similar results. When the results in Table IV are analyzed, it can be seen that the FS2 data set had the same range and similar sensitivity values when compared with earlier cluster-based algorithms [20,21]. In addition, the FS2 dataset appears to give weaker results compared to classification-based algorithms [22,23].
The ability of the features of the FS2 data set obtained after the dimension reduction to distinguish normal, LBBB-and RBBB-type beats has decreased. However, the result of the FS2 data set is satisfactorily high. When the results of the K-Means clustering method are compared in Tables II and III it is evident that the FS1 data set, containing three times as many features, provides better performance for arrhythmia classification than the FS2 data set.
The performance of the K-Means clustering method proposed in the present study has been compared with both clustering and classification methods in the literature. The performance of the current proposed method is comparable with earlier methods and is clearly and reliably proven.
It was concluded that the K-Means clustering method has similar detection capability with current methods in classifying normal, LBBB-and RBBB-type beats using both the FS1 and FS2 data sets.
In this study, normal, LBBB-and RBBB-type heartbeats were classified using the K-Means clustering method. Generally, classification methods are used to identify and classify arrhythmic heartbeats. It was shown that the K-Means clustering algorithm has the ability to classify arrhythmic heartbeats and a high-performance cluster-based arrhythmia diagnosis method is proposed.

VI. CONCLUSION
In this study, a method of distinguishing RBBB, LBBB and Normal beats in the ECG signal using the K-means clustering algorithm is proposed. The ECG signals and the locations of the R peaks were obtained from the MIT-BIH Arrhythmia Database. QRS morphology, HRV and statistical metrics were used to extract the features, creating FS1, and the most appropriate features were selected with the RELIEF dimensionality reduction algorithm, creating FS2. The Kmeans clustering algorithm was used to classify the arrhythmic ECG signals using FS1 and FS2 data sets, and the arrhythmic beat clustering analysis was performed. As a result of the classification, the overall performance results for FS1 were very good and comparable with previous, classificationbased algorithms while the results obtained for FS2 were somewhat lower than for FS1 but were still acceptable for differentiation of normal, RBBB and LBBB heartbeats. In addition, use of the FS2 data set resulted in lower computational cost by decreasing the processing load and complexity, and a highly effective and successful arrhythmia diagnosis method has been proposed. In future studies, different data sets should be analyzed and the number of heartbeat types with arrhythmias should be increased. Morphological, time and frequency domain-based features could be extracted from the ECG signal using different methods and features can be selected using different algorithms. Different clustering methods with high classification performance may then be developed.