Heart attack mortality prediction: an application of machine learning methods

: The heart is an important organ in the human body, and acute myocardial infarction (AMI) is the leading cause of death in most countries. Researchers are doing a lot of data analysis work to assist doctors in predicting the heart problem. An analysis of the data related to different health problems and its functions can help in predicting the wellness of this organ with a degree of certainty. Our research reported in this paper consists of two main parts. In the first part of the paper, we compare different predictive models of hospital mortality for patients with AMI. All results presented in this part are based on real data of about 603 patients from a hospital in the Czech Republic and about 184 patients from two hospitals in Syria. Although the learned models may be specific to the data, we also draw more general conclusions that we think are generally valid. In the second part of the paper, because the data is incomplete and imbalanced we develop the Chow–Liu and tree-augmented naive Bayesian to deal with that data in better conditions, and compare the quality of these algorithms with others.


Introduction
An enormous amount of data is being generated every day. Analyzing big datasets is impossible without the help of automated procedures. Machine learning [1] provides these procedures. The most commonly used form of machine learning is supervised classification [2]. Its goal is to learn a mapping from the descriptive features of an object to the set of possible classes, given a set of features-class pairs.
Probabilities play a central role in modern machine learning [3]. Probabilistic graphical models (PGMs) [4] have emerged as a general framework for describing and applying probabilistic models. A PGM allows us to efficiently encode a joint distribution over some random variables by making assumptions of conditional independence.
A Bayesian network classifier (BNC) [5] is a Bayesian network applied to the classification task. BNCs have many strengths, including good interpretability, the possibility of including prior knowledge about a domain, and competitive predictive performance. They have been successfully applied in practice, e.g., [6][7][8].
Acute myocardial infarction (AMI) is commonly known as heart attack. A heart attack occurs when an artery leading to the heart becomes completely blocked and the heart does not get enough blood or oxygen. Without oxygen, cells in that area of the heart die. AMI is responsible for more than half of deaths in most countries worldwide. Its treatment has a significant socioeconomic impact.
One of the main objectives of our research is to design, analyze, and verify a predictive model of hospital mortality based on clinical data about patients. A model that predicts mortality well can be used, for example, for the evaluation of medical care in different hospitals. Evaluation based merely on mortality would not be fair for hospitals where complicated cases are often dealt with. It seems better to measure the quality of health care using the difference between predicted and observed mortality.
A related work was published by Krumholz et al. in [9], where the authors analyzed the mortality data in USA hospitals using the logistic regression model. In another work [10], the authors designed and verified a predictive model of hospital mortality in ST elevation myocardial infarction (STEMI). In another work [11], the authors analyzed the medical records of patients suffering myocardial infarction from a third world country, Syria, and a developed country, the Czech Republic, and presented an idea of how to deal with incomplete and imbalanced data for tree-augmented naive Bayesian (TAN).

Data
Our dataset contains data from 787 patients from 2 different countries ( 603 patients from The Czech Republic and 184 from Syria) characterized by 24 variables. The attributes are listed in Table 1. Most records contain missing values, i.e. for most patients only some attribute values are available, and some attributes are not available for Syrian patients, i.e. the data is incomplete. The thirty-day mortality is recorded for all patients; 89% of the patients survived, i.e. the data is imbalanced.
In The Czech Republic, the results of blood tests are reported in millimoles per liter of blood. In Syria some of the measurements are reported in milligrams per liter and some in millimoles per liter. We standardized all measurements to the millimoles per liter scale.

Machine learning methods
Since the explanatory variables may combine their influence and the influence of a variable may be mediated by another variable, it is worth studying the relations of variables altogether. We will do it in two steps: (1) since the mortality prediction is of our primary interest, we will compare how different classifiers are able to predict mortality, (2) to get an overall picture of the relations between all variables, we will learn some Bayesian network models from the collected data, (3) to handle incomplete and imbalanced data, we will provide an idea of how to develop the Chow-Liu [12] and TAN algorithms [5] to be able to process this data.
We will work with different versions of data which vary depending on how we treat variables that have more than two states: (1) real valued ordinal variables, (2) discrete valued variables (with five states at most), and (3) binary variables. We will discuss the values' transformation in more detail in the next sections.

Ordinal attributes
In our data, we have several categorical variables (sometimes also called nominal variables). These are variables that have two or more categories. For example, sex is a categorical variable having two categories (male and female). However, for some machine learning methods we need ordinal attributes which are attributes whose values have an ordering of values that is natural for the quantification of their impact on the class. This is satisfied by all attributes that can take only two values even if they are nominal, e.g. by sex (0 for male, 1 for female), mortality (0 for survived, 1 for died). In our data it seems that the ordinality can be assumed for most real valued attributes, but note that the fact that there might also exist laboratory tests whose values deviate from a normal range in both directions (i.e. both lower and higher values) may increase the mortality. We will refer to the ordinal data as D.ORD.

Discrete attributes
Discrete variable is a variable that can take values from a finite set. Some classification methods require discrete variables. To get a statistically reliable estimates of model parameters it is advisable to keep the number of values as low as possible while still being able to express the significant relations. We performed discretization of all real-valued attributes. It is not easy to find the optimum number and the values of split points in discretization.
Fortunately, there exists the Czech National Code Book that classifies numeric laboratory results, with respect to age and sex, into nine groups 1, 2, . . . , 9. The group 5 corresponds to standard values in the standard population. We further reduced the number of states to 5 by joining some groups together. We will refer to data in this form as D.DISCR.

Binary attributes
Binary data are data whose variables can take on only two possible states, traditionally termed 0 and 1 in accordance with the binary numeral system and Boolean algebra. In our case, all laboratory tests are encoded using two binary attributes. The first attribute takes a value of 0 for the standard values of the test and a value of 1 if the values are decreased. The second attribute takes a value of 0 for the standard values of the test and value of 1 if the values are increased. The age, height, and weight attributes are removed. From the demographic group of attributes only sex and body mass index (BMI) were kept with BMI being encoded using two binary attributes BMI high and BMI low where the BMI greater than the mean takes a value of 1, otherwise it takes a value of 0. We will refer to data in this form as D.BIN.

Attribute selection
Before learning a model, we preprocess the data. Usually, one of the most useful parts of preprocessing is the attribute selection, where irrelevant attributes are removed. Attribute selection is a process by which we automatically search for the best subset of attributes in our dataset. The notion of "best" is relative to the problem we are trying to solve, but typically means the highest accuracy. Three key benefits of performing attribute selection on our data are: • It reduces overfitting. Less redundant data means lower possibility of making decisions based on a noise.
• It improves accuracy. Less misleading data means that modeling accuracy improves.
• It reduces training time. Less data means that algorithms train faster.
The CfsSubsetEval method of Weka [13] selects the subsets of attributes that are highly correlated with the class while having low intercorrelation. We searched the space of all subsets by a greedy best first search with backtracking. Data D after the application of this attribute selection method will be suffixed as D.AS.

Tested classifiers
For tests, we used a large subset of classifiers implemented in Weka. Classifiers that performed best in the preliminary tests qualified for the final tests. In the final tests we compared the following classifiers: • Decision tree C4.5 [14].
• Naive Bayes (NB) classifier [16] assumes that the value of a particular explanatory variable (attribute) is independent of the value of any other attribute given the class variable.
All BN algorithms implemented in Weka assume that all variables are discrete finite variables. We will use NA in the results of these classification methods.
We use the leave-one-out cross-validation as the model evaluation method. It means that N separate times, the classifier is trained on all the data except for one point and a prediction is made for that point. After that, the average error is computed and used to evaluate the model.

Prediction quality
For each data record classified by a classifier there are possible classification results. Either the classifier got a positive example labeled as positive (in our data the positive example is the patient not survived) or it made a mistake and marked it as negative. Conversely, a negative example may have been mislabeled as a positive one, or correctly marked as negative. This defines the following metrics: In other words, it shows you how many correct positive classifications can be gained as you allow for more and more false positives. As an example, in Figure 1 we report the ROC curve for the naive Bayes classifier with the ordinal attributes. Its area under the curve is 0.782.

Results of experiments
In Table 2, we compare the results of different classifiers on different versions of data. The C4.5 classifier with D.DISCR has the highest accuracy of 0.942, its recall and precision are also among the best achieved. However, its area under the ROC curve is very low, only 0.371, which suggests that this classifier cannot be satisfactorily tuned if we want to sacrifice precision to recall or vice versa.
The contribution of attribute selection method (CfsSubsetEval method of Weka) to the performance of models was pretty good where the accuracy was improved in general except C4.5 with D.ORD, and LOG.REG with D.ORD and D.BIN. Moreover, the AUC and F-measure were improved in most of the models. Moreover, Precision, recall, and F-measure values of almost all methods are very low because of imbalanced data where we predict patients who will not survive.
In Figure 2, we present the tree structure of the C4.5 learned from the discrete data. It has achieved the highest accuracy from all tested classifiers. Its structure is surprisingly simple. If the patient is Czech then it is predicted to survive if the patient is Syrian then the LDL cholesterol value should be checked. If it is below 4.78 then the patient is predicted to survive, otherwise, if LDL cholesterol value is between 4.78 and 6.28 then it depends on the Syrian hospital in which he/she is treated. If he/she is treated in the public hospital (SYR1) then he/she dies; if he/she is treated in the private one (SYR2) then he/she survives. If his/her LDL cholesterol values are higher than 6.28 then he/she dies (no matter what Syrian hospital he/she is treated in). The simplicity of the C4.5 classifier is in line with the general recommendation that in order to avoid the overfitting of training data the models should be as simple as possible. This is probably the best we can learn from data but most probably it oversimplifies the reality. More data would be needed.
The highest AUC was achieved by naive Bayes classifier with the ordinal attributes. The highest value of F-measure was achieved by BN.K2 with discrete attributes selected by the method CfsSubsetEval method of Weka [13]. The learned BN model is actually also a naive Bayes model, see Figure 3. We can conclude that there is no single winner-a classifier that would be the best in terms of all considered criteria. Moreover, the classifiers differ in what variables they consider to be important for AMI mortality prediction. We believe

Dealing with incomplete and imbalanced data
As we can see from Section 2, our dataset contains incomplete and imbalanced data. In [11] we presented an idea to develop TAN [5] to handle incomplete and imbalanced data (Algoritms 1 and 2), where the conditional mutual information (CMI) is defined as: where the sum is only over x,y,z such that f (x,z) > 0 and f (y,z) > 0. Compute

Algorithm 1 TAN For Incomplete Data
return I p 10: Endprocedure 11: Compute I p = I(A i , A j |C)) between each pair of attributes, i ̸ = j , using the Procedure CMI. 12: Build a complete undirected graph in which the vertices are the attributes A 1 , A 2 , . . . , A n . Annotate the weight of an edge connecting A i to A j by I p = I(A i , A j |C)). 13: Build a maximum weighted spanning tree. 14: Transform the resulting undirected tree to a directed one by choosing a root variable and setting the direction of all edges to be outward from it. 15: Construct a TAN model by adding a vertex labeled by C and adding edges from C to all other nodes in the graph. In a similar way, we can create a procedure that enables the Chow-Liu algorithm to deal with incomplete data, where a normal Chow-Liu algorithm [12] just deals with complete data. The procedure is shown in Algorithm 3, where the mutual information (MI) is defined as: where the sum is only over x,y such that f (x) > 0 and f (y) > 0. Compute I p = I(X, Y ) from D 9: return I p 10: Endprocedure 11: Compute I p = I(A i , A j ) between each pair of attributes, i ̸ = j , using the Procedure MI. 12: Build a complete undirected graph in which the vertices are the attributes A 1 , A 2 , . . . , A n . Annotate the weight of an edge connecting A i to A j by I p = I(A i , A j ). 13: Build a maximum weighted spanning tree. 14: Transform the resulting undirected tree to a directed one by choosing a root variable and setting the direction of all edges to be outward from it.
The idea behind Algorithms 1 and 3 is that we think if we use more data then the estimates of mutual information and conditional mutual information are more reliable.

Results
We will refer to TAN and Chow-Liu which deal with incomplete and imbalanced data as TANI and CLI. We used 10-fold cross-validation to compare how the results change. The results are summarized in Table 3. We compare the results of our methods with those of TAN in bnclassify 1 we will refer to it as (TB), Chow-Liu [12] (we will refer to it as CL), EM algorithm [19] for Chow-Liu using Hugin 2 (we will refer to it as EMCL), normal TAN [5], and [20] (this algorithm deals with TAN based on the EM principle, where they have proposed an adaptation of the learning process of tree augmented naive Bayes classifier from incomplete data, where any variable can have missing values in the dataset) (we will refer to it as FL), and SMOTE algorithm [21] for TAN (we will refer to it as ST), on two versions of dataset (binary and discrete attributes). For measures of the prediction quality, we use log-likelihood (LL) and AUC. Moreover, we use the 10-fold cross-validation as the model evaluation method. Algorithm TANI with D.BIN has achieved the highest AUC (ROC = 0.953 ) and the highest LL with −2744.4279. The results of Algorithm 1 is better than those of the normal TAN algorithm in both datasets D.DISCR and D.Bin. However, ST has achieved the second highest LL with D.DISCR (LL= −6043.0785) but the AUC is (ROC = 0.802), also its ROC is better than the ROC(s) of Algorithm 1 with D.DISCR and Algorithm 3 with both datasets. We can conclude that the TANI is a single winner with D.Bin.

Quality of classifiers tested on artificial data
The data we have is not big enough to have a very good result. Where TAN [5] is a reliable model and has been tested on many datasets, we decided to use the model BN.TAN [5]; its results are presented in Table 2 to generate a sequence of datasets with those sizes (3000, 5000, 7000, and 10,000) and 10% missing completely at random, with 26 attributes including the class in two different types of probability (basic probability distribution and binary distribution) to test the Algorithms (Algo 1, TANI, and FL [20]). See Figures 4 and 5. We can see that our Algorithm 1 is better than the others, and TANI does not seem good with the big binary datasets.

Conclusion
We used medical data on patients with AIM to compare the results of (a) classification models and (b) Bayesian networks modeling the relations found in data. Although the conclusions might seem to be specific only for the data used here, we also report general observations. In principle, the BN learning algorithms are able to discover the mediated correlation, since they test not only pairwise independence but also the conditional independence given values of other variables.
Bayesian networks are a tool of choice for reasoning in uncertainty, with incomplete data. However, often, Bayesian network structural learning only deals with complete data. We have proposed here an adaptation of the learning process of the Chow-Liu and TAN from incomplete and imbalanced datasets. These methods have been successfully tested on our dataset. We have seen that the TANI algorithm is a single winner with D.Bin.