Analysis and detection of Titanic survivors using generalized linear models and decision tree algorithm

ABSTRACT


Introduction
Titanic is a world-famous cruise ship that sank on its first voyage in the North Atlantic [1]. There is a lot of speculation in the literature about the legendary Titanic disaster, and research on this is still ongoing [2][3]. Over the years, a dataset containing information about survivors as well as dead passengers and crew has been created [4]. This data set is publicly available on Kaggle.com [5].
When the literature is reviewed, it stands out that the Titanic data have been examined for different purposes in recent years. In Barhoom et al.'s study, the prediction of survivors was determined by artificial neural networks. The algorithm has achieved 99.28% accuracy [6]. Singh et al. studied Titanic data on logistic regression, decision tree, decision tree with hypertuning, k-nearest neighbors and support vector machines. At the end of the study, they obtained the highest estimation with decision trees as 93.6% [7]. Kakde et al., on the other hand, performed the analysis with logistic regression, decision tree, random forest and support vector machines methods using data cleaning. They suggested that ideally the logistic regression and support vector machine gives a good level of accuracy when it comes to the classification problem [8]. In another study, Kshirsagar et al. showed that Titanic survivors could be predicted by logistic regression with 95% accuracy [4].
With the development of technology, data collection and storage has become quite easy. As a result, it became more important to discover new methods for analyzing data. Much progress has been made in this area in recent years. Important steps have been taken, especially in the field of data mining. Many new algorithms have been introduced and existing algorithms have also been improved. As a result of these developments, reaching different and new results by analyzing the data with different methods has become the goal, as is the case with the researchers working on Titanic data.
In this study, unlike the literature, Titanic data were analyzed using Random Tree algorithm and generalized linear models. The main purpose of the study is to determine the characteristics of survivors of titanic disaster using different methods. In this direction, logit and probit regression models, which are generalized linear models, were examined in the first stage. At this stage, firstly, the significance test was applied to the data and variables that contributed significantly to the model were included in the analysis. In the second stage, analysis was made with the Random Tree algorithm, which is the decision tree learning algorithm. In order to increase the success of the model, random tree classification analysis was repeated with variables that significantly contributed to the model. The study was completed by comparing the results.

Methods and Material
In this study, logit and probit models from the generalized family of linear models and decision tree from data mining methods are discussed.

Titanic Dataset
The dataset contains variables given in Table 1. However, it was determined by binary logit and probit analysis that some of these variables (sibsp, parch, embarked) did not make a significant contribution to the model. Therefore, it has been removed from the dataset. The remaining variables were included in the logit and probit model as categorical data. Descriptive statistical analysis was done with variables in SPSS 22.0 packages program. In the continuation of the study, binary logistics and binary probit analyses were performed with Stata 11.0 program and decision tree classification analysis was performed. Decision tree analysis was done with both the remaining variables of the data set and the original version of the data set. Binary logit and probit regression analyses were made by determining indicator variables. Indicator variables: pclass1, male, age0 (children), fare0.

Generalized Linear Models
Generalized linear models are obtained by extending the linear models due to the assumption distortions [9]. In many fields, these models are used if the data is categorical or discontinuous [10]. Generalized linear models consist of random component, systematic component and link function. The link function determines the name of the model used. If a logit link is used, the name of the model is called the logit regression model [11]. In this study, logit and probit models are discussed.

Logit Regression
If the canonical bond used in generalized linear models is logit, the model is logit regression [12]. Logistic regression is independent variables when the dependent variable is categorical, binary or multiple. In logit regression, there is no assumption of normality and continuity [13]. Therefore, it can be said to be more flexible than linear models.
The logit model is derived from the cumulative distribution function given by Equation 1 [14].
In this model, provides information about the argument while the first individual expresses the probability of making a particular choice [15]. Thus, also takes values between "0" and "1" [16]. When the rate of realization of an event is divided by the rate of the event not realized, the odds ratio is obtained [17].
It becomes linear when the odds rate logarithm is taken. In this case, the model is called logit and the equation given by Equation 2 is called the logit link function.
There are 3 basic methods in logistic regression analysis:

Probit Regression
This model like the logit model is a model that ensures that the probabilities remain between 0 and 1. The probit model assumes that the dependent variable is normally distributed. Therefore, the graph of the logit model is wider than that of the probit model (Fig. 1). Logit and probit models can be compared with a coefficient proposed by Amemiya [18].

Figure 1. Logit and Probit distributions
When the error distribution is the standard normal cumulative distribution, the probit bond function is used and the model is called the probit model [19]. The probit link function is defined by Equation 3.
Here −1 denotes the inverse of the standard normal distribution, is the coefficient estimates and is the explanatory variables. u to show the error for each eye; the standard cumulative distribution function is given by Equation 4.

Random Tree Algorithm
Due to its many advantages, decision tree learning is often used in data mining studies [20]. A tree structure is created in decision tree learning. The tree starts from the root node and from there the structure is divided into inner nodes. The root node can be considered as the most determining feature of the differences between the data. It is divided into internal nodes after a series of operations applied to the data set. Each node is such that it can be divided into multiple internal nodes. The leaf node is reached by controlling all internal nodes. The leaf knot is where the decision is made. Each transition between nodes depends on a condition. The condition mentioned here is the theory on which the chosen algorithm is based [21][22]. Decision trees are very advantageous for reasons such as low calculation cost and ease of understanding. For this reason, as mentioned at the beginning, it is preferred in many data mining and especially classification studies [23][24].
Random tree algorithm is a method in which multiple decision trees are created [25]. Algorithm steps: • The feature that provides the best classification is selected and the starting node is created.
• A training set is formed with a part of the data set. The remaining data is the test set.
• Trees are created with the number of variables to be used in each node and the numbers of trees in N.
Variables are selected randomly at each node.
• When N trees are produced, the model is completed and the class of the new member is estimated [25][26].

Confusion Matrix
Confusion matrix is an analysis tool that explains correctly classified observations and incorrectly classified observations. The confusion matrix is the state of a data set and the number of correct and incorrect predictions of our classification model converted into a table. The general form of the mess matrix is given in Table 2.

Descriptive Statistics
When the relationship between the variables of survived and sex is examined in Table 3, it is seen that 359 people died and 93 survived, 64 women died and 195 survived.
In another comment; of the 423 people who died, 359 were men and 64 were women. Similarly, out of the 288 survivors in the accident, 93 are men and 195 are women. In addition, it can be said that there is a significant agreement between the survived and sex variables in Table  4. When the relationship between survived and fare variables is examined in Table 5, it is seen that 286 of 429 people who paid a low fare died and 143 lived; of the 282 people who paid a high fare, 137 survived and 145 died. It can be said that there is a statistically significant (p=0.00<0.05) relationship between survival and fare variables in Table 6.

Logit Regression Results
When Table 7 is examined, it can be said that the predicted model is significant at 5% error level since p=0.00<0.05.
When the significance values for the variables are examined, it is seen that all the variables have a significant contribution to the model (Those who did not have any meaningful contribution were already removed). It can be said that there is a statistically significant (p=0.00<0.05) relationship between survival and fare variables in Table 5.
Odds ratio is interpreted by reversing. Comments on odds ratios are as follows: • Those in 2nd class are 6.55 times more likely to survive than those in 1st class.
• Those in the 1st class are 4 times more likely to survive than those in the 3rd class.
• Men are 12.80 times more likely to survive than women.
• Children are 4.76 times more likely to survive than the age-1 group.
• Children are 4.54 times more likely to survive than the age-2 group.
• Children are 5 times more likely to survive than the age-3 group.
• Children are 11.11 times more likely to survive than the age4 group.
• Low-payers are 1.63 times more likely to survive than high-payers. The probability of survival for a 1st class, woman, child, high fare person is = 0.98.
The survival probability of a 3rd class, male, age2 group, low fare person is = 0.09.
The marginal effect is the effect that a small change in the independent variable will cause in the dependent variable.
For the logit model given in Table 8, while the effect of other variables is fixed, 1 unit increase in age-1 variable decreases survival by an average of 0.22 units. This result is in line with the results of odds ratio. The classification results for the logit regression model are given in Table 9. The model performs with the classification accuracy of 79.89%.

Probit Regression Results
According to Table 10 the model is significant since it is p=0.00<0.05. At least one variable has an effect on the model. Coefficients are also important except for the fare1 variable. The classification results for the probit regression model are given in Table 11. The model performs with the classification accuracy of 79.04%. The results obtained with the probit model are parallel with the results obtained with the logit model.

Conclusion
In this study, estimation of survivors of titanic accident with different methods was investigated. Factors affecting survival were researched and survival rate was estimated by classification method.
In the first stage, logit and probit regression analyses were performed. With these analyses, variables that contribute significantly to survival were determined and the classification accuracy were found to be 79.89% and 79.04% respectively. In the second stage, two different analyses were done with the random tree algorithm. In the first analysis, variables used in logit and probit regressions that make a significant contribution to the model were used. Classification accuracy was found as 81.57%. The second analysis was done with the variables in the original data set and the classification accuracy fell to 77.21%. When all the results are considered together, it is best to estimate the data that contributes significantly to the model with decision trees.
The study results reveal that, in addition to the expected results, doing decision tree analysis (data mining or machine learning analysis) with data that contributes significantly to the model yields more successful results. These results emphasize that decision-tree learning methods based on new technologies are more successful, but the results can still be enhanced by statistical methods.