A postpruning decision algorithm based on loss minimization

: In this paper, a post-pruning method known as zero-one loss function pruning (ZOLFP) that is based on zero-one loss function is introduced. The proposed ZOLFP method minimizes the expected loss, rather than evaluating the misclassification error rate of a node and its subtree. The subtree is pruned when expected loss of the node is less than or equal to the sum of the loss of its leaves. The experimental results demonstrate that ZOLFP method outperforms Un-pruned C4.5 Decision Tree (UDT-C4.5) algorithm, reduced error pruning (REP), and minimum error pruning (MEP) in terms of performance accuracy in all used datasets. It is also shown that the complexity of the proposed method ZOLFP is not more than the complexity of REP and MEP methods. Furthermore, the results show that ZOLFP method achieves satisfactory results compared to REP, MEP, and UDT-C4.5 algorithms in terms of precision score, recall score, true positive rate, false positive rate, F-measure, and area under ROC scores during the experiment process.


Introduction
Due to the advance progress in information technology, a huge amount of data is generated. Different approaches have been introduced to analyze and handle this huge data for future decision making. Data mining is one of the newest powerful technologies used to extract valuable and meaningful knowledge from large databases. It is also known as knowledge discovery process or data pattern examination [1][2][3]. Several data mining techniques have been developed and used for knowledge extraction process from databases [4]. A decision tree that has a flow-like structure is one of the effective data mining techniques that is widely used due to its higher accuracy [5]. Induction is performed in top-down method and some attribute selection measures are used to select attributes such as information gain, gain ratio, and gain index. During the induction process, a decision tree may generate an unwanted and meaningless tree that is large in size with more complexity because of the overfitting problem. Therefore, pruning methods are introduced to combat overfitting problems that affect the efficiency and accuracy of data. Two types of pruning are used: postpruning and prepruning. In postpruning, initially the tree is grown and then unsuitable branches are removed. Hence, postpruning decision tree produces the (complete) tree and then adjusts it in order to improve the classification accuracy on unseen instances. In prepruning, the overfitting is checked during the tree building process. There are two techniques for prepruning: minimum number of object pruning and chi-square pruning. error complexity pruning, minimum error pruning (MEP), and cost-based pruning. Among these techniques, REP produces smaller trees with better accuracy. REP traverses the decision tree in postorder by checking all internal nodes and replaces subtrees with their leaf nodes when the former misclassification error rate is no smaller than the later. In this method as in some other postpruning methods, the dataset is divided into three sets as training set that is used for training, the validation set that is used for pruning the tree, and test set that is used for an unbiased estimation over future unseen instances [6]. One of the advantages of this method is its linear computational complexity, but when the test set is much smaller than the training set, it tends to overprune.
The aim of most pruning algorithms is to minimize the expected error rate of the decision tree. Recently, some postpruning methods have been introduced to prune the tree with respect to the loss matrix [7][8][9]. Thus, in such cases, it may be desirable to prune the tree by estimating the expected loss instead of estimating the misclassification error rate. Pruning for loss minimization is another point of view in pruning that may result in different pruning behavior than does pruning for error minimization. This type of pruning relays on the probability distribution to adjust the prediction to minimize the expected loss or to supply a confidence level associated with the prediction [10]. While some researchers employed statistical methods to tackle the overfitting problem [11], some researchers adopted the idea of incorporating the general loss matrix into C4.5 decision tree algorithm to reduce the misclassification cost or to minimize the misclassification loss. These algorithms developed heuristic splitting methods that perform dual tasks. The first task is to select the splitting that minimizes the misclassification cost and the second task is to select proper split for subsequent splits.
In this paper, a new decision tree postpruning method known as zero-one loss function pruning (ZOLFP) that is based on zero-one loss function is introduced for loss minimization. Instead of estimating the misclassification error rate, the loss (or cost) of misclassification is estimated for each node and its subtree. If the loss of the node is less than or equal to the sum of the loss for the whole subtree under that node, the subtree is pruned. Several experiments are conducted to investigate the effectiveness of proposed ZOLFP method and its results are compared with those of the REP and MEP methods and Un-pruned C4.5 Decision Tree (UDT-C4.5).
The paper is organized as follows. Section 1 presents the introduction and research motivation, whereas, in Section 2, the literature review and related works are discussed. The proposed lost minimization based on ZOLFP algorithm is investigated and studied with a running example in Section 3. In Section 4, the experimental results about the proposed method are presented and compared with some existing approaches. Finally, the conclusion of the paper and future work are explained in Section 5.

Related Works
Over the past decades, different postpruning techniques have been introduced by researchers to investigate the overfitting problem. In [4], the researchers performed a comparative study to investigate and analyze six well-known postpruning techniques, namely, REP, pessimistic error pruning, MEP, critical value pruning, cost complexity pruning, and error-based pruning. The researchers highlighted the theoretical weakness and strengths of each method. Their results showed that REP produced the smallest subtree with the lowest error rate with respect to the pruning set. However, in [12], the researchers introduced a new decision tree algorithm based on J48 and REP. The new method was compared to original J48 decision tree and the results showed that their method produced a smaller tree and produced better performance accuracy. In [13], the researchers implemented decision tree induction algorithm with REP technique to improve the performance accuracy. Their proposed method generated an optimal decision tree with less complexity and better performance accuracy.
The research in [14] addressed the problem of multilabel classification where each example can belong to more than one class at the same time. The researchers introduced a new pruning technique known as PruDent. In terms of performance accuracy, PruDent was more accurate when compared with other state-ofthe-art approaches and its computational costs were linear. On the other hand, one of the drawbacks of this method was the reliance on confidence scores. Furthermore, authors in [15] adopted ability, stability, and scale as new classification standards for classification evaluation. Results showed that the improved method solved problems better than single standard evaluation methods and reflected more advantages. Thus, this improved method produced a more balanced classification performance and less model complexity. The researchers in [16] introduced a new decision tree algorithm that was named as competition cost-sensitive C4.5 for numeric data based on C4.5. They designed a heuristic function that was based on the test cost and the information gain ratio. The results showed that the proposed postpruning algorithm was more effective and stable, and it produced a decision tree with a lower cost.
In [17], the researchers developed a postpruning decision tree algorithm that was based on the Bayesian theory, where the branches that were generated by the C4.5 algorithm validated by the Bayesian theorem. This method produced a small decision tree since the branches out of the condition range were removed. The results showed that the proposed method produced a simpler decision tree compared to the original C4.5 algorithm.
In [6], various measurement techniques were utilized to evaluate the performance of postpruning methods like accuracy, stability, and simplicity. A multiobjective evaluation was proposed to select the best subtree during the postpruning process. Moreover, the researchers developed a procedure for obtaining the optimal subtree based on user-provided preference and value function information. The researchers in [18] conducted a comparison study for REP method in decision tree. The performance of REP was analyzed and they deduced that if the algorithm produced a simple tree with low accuracy it meant that the algorithm computed high misclassification during instance learning process. The results showed that J48 and REP produced a tree with high accuracy of classification with less complexity. The researchers in [19] introduced a new pruning method that aimed to improve both classification accuracy and tree size. The newly introduced method was first applied to decision tree by implementing prepruning at the inducing phase of the tree, and postpruning after the inducing process. The researchers in [20] introduced a new decision tree pruning method based on backpropagation neural networks, called soft-pruning. Firstly, C4.5 method was employed for decision tree induction and then the obtained trees were pruned by using the soft-pruning method. The experiment results indicated that the soft-pruning method performed better than original C4.5 pruned and unpruned trees.
The researchers in [21] proposed a new C5.0 classifier method that employed feature selection, cross validation, REP, and model complexity for the original C5.0 method to reduce the tree size and improve the accuracy. The experimental results showed that when feature selection was applied, the attribute space was reduced for feature set while applying the cross validation technique provided a more reliable estimate of predictive. On the other hand, when the model complexity was increased, the accuracy of the classification was also increased, and whenever the REP technique was applied, the overfitting rate of the decision tree was reduced. The accuracy of the proposed method was improved compared to the original C5.0 method. A new pruning method called multilevel pruned classifier that integrated the pruning phase into the building phase was developed in [22]. The experiments were conducted on the dataset with and without pruning in terms of complexity and classification accuracy.
The researchers in [23] proposed a decision tree method that adopted rough set theory for pruning. The proposed method introduced depth-fitting ratio which involved both the depth and the explicit degrees of the subtrees under evaluation. The experimental results demonstrated that the newly proposed method was feasible and effective for pruning. The constructed decision tree sizes were quite reduced, while the prediction accuracy was well improved. Furthermore, in [24], a multistrategy pruning algorithm was introduced to trim the tree.
During the pruning process, three groups of strategies were implemented to get the optimal solution, which were namely, simple size, degree of matching, and scale of the tree. The experimental results illustrated that the multistrategy pruning algorithms for decision tree pruning improved the efficiency and accuracy of intrusion detection system. The transactions were classified into four risk levels instead of classifying it either fraud or nonfraud, such that the proposed method showed a promising progress.
In [25], the researchers introduced a decision tree pruning method based on genetic algorithms. The experimental results indicated that applying genetic algorithms in decision tree pruning was effective and feasible, and the pruning operation was converted to optimize the edge weight. The new approach was compared with some decision tree pruning techniques including cost-complexity pruning, pessimistic error pruning, and REP. The results showed that the new method had better or equal effect with other pruning methods. A unifying framework based on the four-tuple (space, operators, evaluation function, and search strategy) was introduced in [26]. The pruning methods were investigated by means of this framework and their common aspects, strengths, and weaknesses were described. The experiments were conducted by utilizing these six wellknown pruning methods. A comparison was performed between the mentioned pruning methods, and their common aspects, strengths, and weaknesses were also investigated. The results demonstrated the fact that pruning methods did not reduce the productivity, but they could enhance the final tree accuracy.

Proposed zero-one loss function pruning algorithm based on loss minimization
In the past decade, many algorithms have been introduced to minimize the misclassification error rate, while some researches attempted to minimize the expected loss or the cost of misclassifying an example in the dataset. Although, recent researches continued to develop more accurate algorithms to enhance the misclassification error rate, applications in business, medicine, and science have shown that real problems require more subtle measures of performance [27][28][29][30]. Furthermore, in general, various kinds of errors have different costs. The performance of classifiers is evaluated by estimating the error rate obtained from the dataset which represents the incorrect classified examples. It is sometimes suitable to evaluate classifiers by considering the cost of the misclassification errors based on loss matrix.
Several researchers have adopted the idea of incorporating the general loss matrix into C4.5 decision tree algorithm to reduce the misclassification cost or to minimize the misclassification loss [31][32][33][34]. These algorithms introduced heuristic splitting methods that perform dual tasks. Initially, they specify the splitting that minimizes the loss and secondly, they select the proper split for subsequent splits. Normally, the loss matrices are T c by T c matrices where T c is the number of classes, while the rows represent the predicted classes by the classifier algorithm and the columns represent the correct classes. The loss function L(i | j) gives the "loss" of predicting class i when the true class is j , and diagonal elements, L(i | i) , are always zero [35]. In general, loss functions state exactly how each action costs, for instance, the loss function L(a i | C j ) indicates the loss obtained for taking action a i when the class is C j .
In this paper, a new decision tree postpruning method that is based on loss minimization is introduced. This method focuses on employing the Zero-One loss function in C4.5 decision tree to avoid the over-fitting problem and improve the performance. The expected loss or so-called the conditional risk of Zero-One loss function is defined by: where P (C j | x) is the probability that an example x belongs to a specific class C j .
Here, Laplace method is utilized to estimate the probability that an example belongs to class i in the specific node. If there are N i examples of class i at a node and T c classes, then the probability P ic that an example at this node belongs to class i is estimated by [36]: where N T is the total number of examples at that node.
Thus, when Laplace method is employed in (1) to estimate the probability that an example belongs to specific class in a node, the zero-one loss is computed as follows: The proposed pruning method prunes the decision tree in the bottom-up fashion by adopting REP approach. However, instead of estimating the misclassification error rate, the expected loss is estimated by applying zero-one loss function method. For that reason, the proposed method is named as zero-one loss function pruning. The zero-one loss for each node is computed by applying (3). Then, the expected loss, L ex , for each class is computed by multiplying the zero-one loss by the loss function related to that node as: The pruning process is performed by dividing the dataset into two sets as training set and pruning set. The training set is used to create the decision tree and then it is tested by using the pruning set. During the pruning, the expected loss for each tested node (L p ex ) is computed as the sum of the expected losses of its classes. If the expected loss for the tested node is less than or equal to the sum of the expected loss for the whole subtree under that tested node (L t ex ), these tested nodes are converted to leaf nodes, and leaf nodes are labeled by the least expected loss class that is equivalent to label of the class with the majority class label in decision tree nodes. The converted test nodes to leaf nodes are given the class label of the least expected loss class. The tree is created based on C4.5 decision tree algorithm and then the pruning process is performed by using the newly introduced ZOLFP method that is shown in Figure 1.
How the proposed method works is shown with a running example in Figure 2. In this example, there are two classes for unpruned decision tree, a person is classified as healthy or sick accordingly. The proposed method, ZOLFP, is examined to prune this tree based on loss (cost) approach. The ZOLFP method computes the expected loss for each node by applying Eq. (4). The loss function for each node is indicated where h denotes the healthy class and s denotes the sick class. In Figure 2a, the subtree should not be pruned when ZOLFP method is applied because the parent loss is greater than the sum of losses of the children (10.4 versus 5.9). Figure 2b shows the reverse situation. Here, the ZOLFP method prunes the tree because the parent has lower loss than the sum of losses of the children (9.2 versus 14.2). Thus, the subtree is pruned.

Experimental result of the proposed ZOLFP method
In this section, the performance of the proposed ZOLFP method is demonstrated and compared to UDT-C4.5, REP, and MEP approach on different datasets.

Input: Dataset
Output: Post-pruned decision tree with Zero-One Loss Function

1.
Divide the dataset into two set 70% for the training set and 30% for pruning set

2.
Use the training set to create a decision tree based on C4.5

3.
Use the pruning set to test and prune the decision tree

4.
For each node

5.
If node is a test (parent) node then

6.
Compute the expected loss for this test (parent) node ( ) and the expected loss of its subtree ( )

8.
Convert the test (parent) node to a leaf

10
Retain the test (parent) node

14.
Return the final tree

Experiment data
The experiment datasets are collected from UCI machine learning repository (https://archive.ics.uci.edu/ml).
In this paper, six different datasets have been used to evaluate the proposed method. Table 1 presents the number of instances, the number of classes, and the number of attributes for the datasets.

Experiment process
The experiment process is carried out by using java eclipse combined with Weka. Additionally, Weka attribute evaluator techniques namely One Rule (OneR) and Information Gain (InfoGainAttributeEval) are employed to select the attributes with high impact on the datasets and remove the worst attributes that affect the accuracy performance as shown in Table 2.
The datasets in Table 1 are trained and tested with 10-fold cross validation by the proposed method ZOLFP after the worst attributes are removed by OneR attribute evaluator and InfoGainAttributeEval attribute Healthy=95 Healthy=90 P 1c =0.01 L ex =0.9 Sick=0  evaluator. A comparison is conducted between the 10-fold attribute evaluator's result and the 60% hold-out validation results as indicated in Table 3. The results in Table 3 show that 60% hold-out validation method obtains better performance for ZOLFP method compared to the case when 10-fold cross-validation is used.
It is also obvious that ZOLFP with 10-fold OneR attribute evaluator achieves better accuracies compared to ZOLFP with 10-fold InfoGainAttributeEval evaluator with four scores to the former and none scores for the latter, whereas both methods have the same sores in one dataset. Because hold-out method suffers from some limitation, it is not sufficient to give a reliable model so that cross-validation methods are considered is this paper.
OneR attribute evaluator with 10-fold cross validation method is employed in this experiment since it shows better performance compared to InfoGainAttributeEval attribute evaluator with 10-fold cross validation method.

Experiment result analysis
Typically, the complexity of the tree is measured by one of the metrics such as total number of nodes (tree size), total number of leaves, tree depth, and the number of attributes used [18,37]. Table 4 shows the accuracy and the tree complexity in terms of tree size as the total number of nodes and leaves for ZOLFP, REP, and MEP approaches. The results are also compared with UDT-C4.5 algorithm to illustrate the effect of pruning. For all the datasets, the proposed ZOLFP method produces better accuracy compared to REP and MEP methods. In terms of complexity, the proposed method, ZOLFP, produces smaller tree size than REP and MEP for three datasets. For the dataset which the ZOLFP complexity is higher, the accuracy of ZOLFP is better than the accuracies of REP and MEP. On the other side, it is obvious that the proposed method, ZOLFP, produces a smaller tree with better accuracy compared to UDT-C4.5 for all datasets.
Additionally, precision score, recall score, the weighted averages of true positive (TP) and false positive (FP) rates, F-measure, and area under receiver operating characteristic (ROC) are also investigated to evaluate the performance of ZOLFP, REP, and MEP methods as shown in Table 5. For precision scores, ZOLFP produces better scores than REP and MEP in all datasets, where in terms of recall scores, ZOLFP produces better scores in five datasets. For Diabetes dataset, the recall scores of ZOLFP and REP are the same but the precision score of ZOLFP is better than REP's for the same dataset. The proposed method produces the highest TP rate and F-measure in all datasets and the lowest FP rates for four datasets. Consequently, the proposed ZOLFP method produces highest scores in terms of the area under ROC and compared to REP and MEP with four scores.

Conclusions
In this paper, a decision tree pruning method known as ZOLFP that takes into accounts both classification accuracy and tree size is introduced. We proposed a zero-one loss function pruning method that minimizes the expected loss of the produced tree. This method is based on loss minimization, instead of estimating the misclassification error rate the loss of misclassification is estimated. The proposed method is trained on six different datasets and a comparison is conducted between the proposed method ZOLFP, UDT-C4.5, REP, and MEP methods in terms of performance accuracy and complexity.
The experiment results show that the ZOLFP method outperforms REP and MEP methods in term of performance accuracy for all datasets employed. While in terms of complexity, the ZOLFP method produces less tree size than REP and MEP in three datasets. On the other hand, the ZOLFP method outperforms UDT-C4.5 method in terms of accuracy and complexity for all datasets. Furthermore, the results also show that the proposed method, ZOLFP, yields reasonable performance scores for precision score, recall score, TP rate, FP rate, and area under ROC compared to REP and MEP methods.
The proposed algorithm adopts a postpruning bottom-up method for C4.5 decision tree algorithm. As