PREDICTING CREDIT CARD CUSTOMER CHURN USING SUPPORT VECTOR MACHINE BASED ON BAYESIAN OPTIMIZATION

In this study, we have employed a hybrid machine learning algorithm to predict credit card customer churn. The proposed model is Support Vector Machine (SVM) with Bayesian Optimization (BO). BO is used to optimize the hyper-parameters of the SVM. Four di¤erent kernels are utilized. The hyper-parameters of the utilized kernels are calculated by the BO. The prediction power of the proposed models is compared by four di¤erent evaluation metrics. Used metrics are accuracy, precision, recall and F1-score. According to each metrics linear kernel has the highest performance. It has accuracy of %91. The worst performance achieved by sigmoid kernel which has accuracy of %84.


Introduction
Customer churn is a business term expression which describes loss of customers. Firms invest in order not to lose their customers. Marketing departments continuously investigate the behavior of their existing customers and potential customers to understand the underlying causes of churn. These investigations are costly and time consuming. For that reason, in this study we propose a hybrid machine learning algorithm to predict customer churn of a bank by using the available data. We propose a model based on Support Vector Machine (SVM) which has many applications on regression and classi…cations. We utilized SVM as the classi…er in this study because it ensure to use the technique called kernel transformations, projects the features space to a higher dimension, which makes it easier to …nd the bound between the classi…cation objects. These kernels are non-linear so SVM can 828 K. D. ÜNLÜ capture complex relations between the observations without making complex calculations. Some application areas of SVM are …nancial bubble detection [1], stock market movement forecasting [2], …nancial time series forecasting [3], oil price forecasting [4] and air pollution modelling [5].
The SVM has three hyper-parameters. The …rst one is C. It is the penalty parameter and it tells the magnitude of the margin of the hyperplane. Large values of C imply small margin while small values of C imply large margins. The second is the kernels. These can be radial basis, polynomial or sigmoid. The last one is the parameter. It decides the curvature of the hyperplane. A high value indicates more curvature while a low value represents less curvature. The parameters can not be predicted by the algorithm itself. They can be de…ned by the user or optimization algorithms can be employed to decide these parameters. In this study we use Bayesian Optimization to handle the hyper-parameter optimization problem.
[1] compares SVM with arti…cial neural networks (ANN), k-nearest neighbours (KNN) decision tress (DT), random forest (RF) and logistic regression (LR) to predict …nancial bubbles in the S&P 500 index. Their …ndings show that SVM is favourable among the others with almost %95 accuracy. [2] compares the performance of SVM with Linear Discriminant Analysis, Elman Backpropagation Neural Networks and Quadratic Discriminant Analysis to predict the markets movements of NIKKEI 225. Their results show that SVM outperforms the other classi…ers. [3] compares SVM with multi-layer back-propagation (BP) neural network to forecast …ve futures contracts of Chicago Mercantile Market. The authors show that SVM outperforms BP based on weighted directional symmetry, mean absolute error, directional symmetry and normalized mean square error. [4] investigated the prediction power of SVM on oil price forecasting and compared it with auto regressive moving average (ARIMA) and BP. The …ndings show that the prediction power of SVM outperforms the others. Lastly, [5] use SVM to predict air pollution in the urban areas of Honk Kong and the proposed model compared with ANN. The …ndings reveal that SVM performs better than ANN. The literature above mentioned provides the necessary evidence of the performs of SVM in both classi…cation and regression. For that reason, in this study we chose our classi…er as SVM.
Summary of some related works which employ machine learning algorithms to predict customer churn are given in this paragraph. Customer churn prediction based on textual data is studied by [6]. The Convolution Neural Network (CNN) is proposed as the model. The data set contains structured information with textual information. The results show that using textual data as a feature of the model increases the performance of the proposed model. [7] use churn rate of the customer to predict the electricity sales of the power market. Credit card churn prediction is done by [8]. The used models are logistic regression and decision tree based methods. The comparison of the models show that logistic regression performs better than the tree algorithms. Extended SVM (E-SVM) and ANN are proposed by [9] to model customer churn in e-commerce sector. The results show that E-SVM has better performance based on accuracy, coverage rate, hit ratio and lift coe¢ cient. Also, it is noted that the new algorithm handles data well when imbalanced is an issue. [10] propose SVM and RF to predict customer churn of telecom sector and the results reveal that the investigate learning models behave similarly. Ten di¤erent machine learning algorithms are compared by [11] to classify customer churn. The …ndings of the study indicate that best performance achived by RF and ADA boost with almost %96 accuracy and SVM with %94 accuracy. Some other recent machine learning approach on customer churn predictions are [12], [13], [14] and [15].
The remainder of this paper is organized as follows. Section 2 devoted to the methodology. Data and experimental results are given in Section 3 and …nally Section 4 concludes the study.

Methodology
2.1. Support Vector Machine. Support vector machine is a supervised machine learning algorithm that can be used for regression or classi…cation. It is introduced by [16]. The main idea under the algorithm is to …nd a hyperplane to separate a data set into multiple classes. For instance, if there are two linearly separable classes in a data set, multiple lines can divide the data into two parts. SVM proposes to …nd the line which maximize the margin between the closest data points. These data points are called support vectors. For more than two separable case the algorithm uses hyperplane for classi…cation. If the data set contains classes which are not linearly separable than kernel tricks are used. It is the transformation of the features to the higher dimensions which makes it easier to separate.
Suppose it is given a data set which has n observations of d variables with features (x 1 ; x 2 ; :::; x d ) where x i 2 R n and labels (y 1 ; y 2 ; :::; y n ) where y i 2 f 1; 1g. De…ne the linear classi…er where w is the weight vector and b is the bias term. If the data set is linearly separable than the hyperplane w T x + b separates the two class as: These two equations can be combined in one equations by multiplying both by y that is The margin between the support vectors and the hyperplane is 2 kwk . The optimal solution is found by maximizing the margin that is to minimize the length of w: 830 K. D. ÜNLÜ Solution for the above optimization problem can be obtained by using the Lagrange's method as and is the non-negative Lagrange multiplier. The classi…er for the linear case can be obtained as In the non linear case the classi…er transformed to Mostly used kernels are: 2.2. Bayesian Optimization. Bayesian optimization is an iterative optimization which is very popular in hyper-parameter optimization of machine learning algorithms [17]. It searches and …nds the candidate values based on previously obtained values. It contains two important elements called acquisition function and surrogate model [18]. The observed data points are …t into an objective function by the surrogate model. The acquisition function determines which points are used to balance the distribution of the surrogate model by evaluating the arrangements between exploration and exploitation [19]. Exploration is the process to search the upsampled area while the exploitation is the process of searching the most promising area in which the global minima or maxima may occurs.
In this paragraph we try to summarize Bayesian Optimization based on the work [17]. Firstly, the algorithm builds a surrogate model for the objective function. Secondly, using the surrogate model, it determines the optimal parameter values. Thirdly, the determined values are tested in the real objective function. Finally, the surrogate model is updated by the new results. These procedure repeats until the maximum number of iterations are achieved based on the initially surrogate model. Gaussian process can be given as a classic example of a surrogate model. This algorithm is more e¢ cient than grid search and random search, for that reason it is employed in this study. It measures the classi…er ability to identify the all positive sample points. F 1 -score is the weighted average of precision and recall. It can take values between 0 and 1. The performance of the algorithm is at the best when takes value 1 or near to 1. In the same manner it is the worst when takes 0 or values very near to 0. It is calculated by the following formula: Finally, accuracy is the fraction that the model predicts correctly. It is calculated as the ratio of sum of the total true positive and true negative to the total predictions. That is Accuracy = True Positive + True Negative True Positive + True Negative + False Positive + False Negative : It can take values between 1 and 0. If the performance of the model is high, it will take values near to 1, otherwise near to 0.

Data and Analysis
The data set for this study is obtained from the Kaggle [20] which is a machine learning and data science community. The data set contains 20 variables and each contains 10127 observations with no missing values. The variable with their descriptions are given in Table 1.
The data set contains categorical and numerical variables.   For categorical variables which has more than 2 di¤erent observations, one hot encoding is used. It is used because there is no ordinary relations between the observations. Otherwise, algorithms would assume natural ordering between the categorical variables which leads poor performance.
According to the data 16% of the customer leaving the bank while 86% staying. The vast majority of the customers are married and female level is slightly higher than the male proportion by 3% . Mostly, blue credit cards are used and in general income levels are less than 40000$. More than 30% of the credit card users have graduate level. The age of the customers are between 26 and 73. Lastly, credit card limits are between 1.438 and 34.516. Correlation between the numerical variables are given in Figure 1. The colour codes of the …gure is given in the right hand side of the table. Light red implies strong positive correlation while dark purple implies negative correlations. It is seen that there exists high positive correlation between MA -CA, OB -CL, TC4 -TA and AU -RB, high negative correlation between AU -CL and AU -OB.
The data set divided into test and train set. The test set contains the %20 of the data while the rest is the train set.
We have started our analysis with the linear kernel. The best parameter for C, the penalty parameter, is obtained as 37:5598. On the train set the algorithm with the given parameters has %91 accuracy. The other metrics are given in the Table  2.  As Table 2 shows linear kernel has weighted average, calculates the metrics for each label and takes the weighted average according to number of supports, of the precision, recall and F 1 -score as 0:91 while it has accuracy of %91.
Secondly, polynomial kernel is utilized and by the help of the Bayesian optimization the best parameter for C is obtained as 0:28860 with = 5:3504. The accuracy of the train set with the given parameters are obtained as %87. The other metrics are given in Table 3. As Table 3 shows polynomial kernel has weighted average of the precision as 0:77, recall as 0:78, F 1 -score as 0:77 while it has accuracy of %88. It can be said that polynomial kernel is worse than the linear kernel according to the calculated metrics.
Thirdly, radial basis kernel is employed to predict credit card churns. The best parameter for C is obtained as 11:6085 with = 3:2151. The accuracy of the kernel in the train set is obtained as %86. The other metrics are given in Table 4. As Table 4 shows radial kernel has weighted average of the precision as 0:75, recall as 0:64, F 1 -score as 0:67 while it has accuracy of %86. It can be said that polynomial kernel is worse than the linear kernel and polynomial kernel according to the calculated metrics.
Lastly, sigmoid function is used as a kernel. The best parameters for the model are observed as C = 45:4489, = 6:3796. The model with these parameters have %83 accuracy. The metrics on the test set are given in Table 5. The worst result upon the investigated kernels are achieved by the sigmoid functions. The algorithm made 2026 forecasts and 2021 were identi…ed as 1. As Table  5 shows it has accuracy of %84 while it has very low scores on precision, recall and F 1 -score.

Conclusion
In this study, it is aimed to use a hybrid machine learning algorithm to classify the credit card churn of a bank. It is shown that the best kernel to predict churn behaviour of the customers is the SVM with linear kernel. Although, the data set is complex and contains many explanatory variables, a linear model …ts the data better than the non-linear ones. The hyper-parameters of the algorithm is obtained by another algorithm called Bayesian optimization. Although, Bayesian optimization is not the only choice, it is utilized because of the ‡exibility and the speed of the algorithm. For the future studies the hyper-parameter optimizations tools can be compared and other machine learning and deep learning algorithms can be utilized to classify the churn behaviour of the customers.
Declaration of Competing Interests No potential con ‡ict of interest was reported by the author.