Dynamically updated diversified ensemble-based approach for handling concept drift

: Concept drift is the phenomenon where underlying data distribution changes over time unexpectedly. Examining such drifts and getting insight into the executing processes at that instance of time is a big challenge. Prediction models should be capable of handling drifts in scenarios where statistical properties show abrupt changes. Various strategies exist in the literature to deal with such challenging scenarios but the majority of them are limited to the identification of a particular kind of drift pattern. The proposed approach uses online drift detection in a diversified adaptive setting with pruning techniques to formulate a concept drift handling approach, named ensemble-based online diversified drift detection (En-ODDD), with an aim to identify the majority of drifts including abrupt, gradual, recurring, mixed, etc. in a single model. En-ODDD is equipped with a dynamically updated ensemble to speed up the adaptability to changing distributions. Unlike prevalent approaches, which do not consider correlations between experts, En-ODDD entails experts using varying randomized subsets of input data. Different levels of sampling having been applied for diversity generation to promote generalization. Prediction accuracy has been used to evaluate the effectiveness of the proposed approach using Massive Online Analysis software and compared with ten state-of-the-art algorithms. Experimental results on fifteen benchmark datasets (artificial and real-world) having up to one million instances depict that En-ODDD outperforms the existing approaches irrespective of nature of drift.


Introduction
In today's world, a wide range of application domains like adaptive system control, text mining, and information retrieval generate in-stream data and many among them register concept drift [1,2]. Adversarial conditions, changes in population, switching of user interests, and complexity of the environment are various reasons for drift to occur. Furthermore, with the passage of time, data that were once used to train the classifiers become outdated, thereby reducing the prediction capability of the model. To handle nonstationary types of data is a challenging task in stream mining. Relevant works are mainly classified as online or block-based techniques.
In block-based techniques, data are processed in the form of fixed-size or variably sized batches [3,4], whereas online approaches analyze instances on the go [5,6]. Both of them can use either an explicit drift detection mechanism [7][8][9] or implicit adaptive strategy to handle evolving data distribution. In adaptive approaches, high computational cost is incurred due to constant updating of ensemble members even in the absence of drift. On the other hand, drift detector-based techniques are sometimes not effective as they lack perfectly updated component classifiers. They might face catastrophic forgetting, especially in the case of recurrent changes. Hence it is imperative to combine characteristic features of both groups such that a better adaptation to all kinds of drifts is achieved. Majority of the existing streaming approaches do not focus upon the the use of diverse ensemble components. Ensemble members trained on similar data become almost the same after long periods of stability and consequently do not react well to incoming streams. Thus, to enhance the true detection capability, algorithms should consider the diverse effects of statistical changes corresponding to specific drift patterns [5].
Considering these motivations, a novel concept drift detection technique has been proposed. This paper puts forward a hybrid diversity-based approach in line with the characteristic features of both the explicit and adaptive techniques. The contributions of our paper are summarized as follows:

Ensemble-based online diversified drift detection (En-ODDD):
The paper introduces an explicit trigger-based drift detection mechanism in the dynamically updated block-based ensemble framework, which obtains high prediction performance in varied data stream settings. The ensemble experts are built using the most recent data chunk and are pruned in a timely manner to deal with the deteriorating performance of the overall ensemble and cope with drifting data.

Diversified incremental training:
En-ODDD augments the usual incremental training by deployment of an online bagging approach at the time of creation as well as updating ensemble experts, which introduces diversity and randomization to the input instances. We construct an ensemble of experts, which uses various subsamples of training data to achieve high accuracy and generalization. This ensures effective learning of the underlying models even during stable periods.

Extensive experimentation:
The performance of the proposed concept drift-handling approach has been evaluated to examine the effectiveness and reliability of En-ODDD on twelve artificial datasets and three popularly used real datasets in the concept drift domain, namely Poker, Covertype, and Weather. Along with the analysis of prediction accuracy, other performance measures such as model cost, training time, and testing time have also been considered and validated using statistical tests.
Comprehensive empirical study: Results have been obtained by comparing ten existing online and block-based algorithms by inducing a majority of drift patterns including gradual, abrupt, and recurring through the Massive Online Analysis (MOA) framework for data streaming. Evaluation on complex combinations of drifting streams, which are otherwise difficult to handle, is a major highlight.
The rest of the paper is organized as follows: in Section 2, overview of the related work is given. The proposed approach, En-ODDD, is discussed in Section 3, while the methodology adopted is stated in Section 4. The results along with discussion are provided in Section 5, along with statistical analysis. Section 6 highlights the threats to the validity of the paper. Section 7 provides the conclusion of the work, giving points for future extensions.

Basic notations
Concept is the quantity that a particular learning model ( M ) is trying to predict. Concept drift refers to the scenario in which statistical properties of the target concept or underlying conceptual data distribution changes over time. In classification, a model is built with the objective of predicting the target class label y i (i = 1, 2,...m) of the incoming data instance x . At every time step t, the model analyses label training instances of X = ( x 1 , x 2 , x 3 , ....x t ) while an incoming instance x t+1 is treated as the testing instance. Prediction is based on the estimated distributionD t represented by joint probability P t (X, y i ) at time step t.
Concept drift is registered whenever there occurs any change in joint probability between time step t 0 and t 1 [10].

Streaming approaches to handle and/or detect concept drift
Related works on approaches can be grouped into four types: ensemble-based update works, explicit detectorbased works, windowing-based works, and diversity-based works.
Many ensemble-based algorithms have been discussed in the literature, which are used to handle concept drifts [4,6,11]. Compared to single classifier-based techniques they provide better adaptability to drifting streams as they capture the dynamic concept under nonstationary conditions. Dynamic weighted majority (DWM) [12], accuracy updated ensemble (AUE2) [3], and weighted majority algorithm (WMA) [13] are popular ensemble-based techniques. They have weighted learners, which are built and removed from ensembles while responding to drifts in prediction accuracy. However, their configuration depends only on a current batch of examples and lack adaptability to sudden drifts. Generally, datasets constitute different types of drifts. However, in most of the existing ensemble-based techniques, only a particular type of drift pattern is handled [14,15]. This paper incorporates the specific mechanism of a detector to handle abrupt drifts and constant weight-based updates to handle gradual and recurring drifts. As the training of base learners of the ensemble is a time-consuming process, in En-ODDD multithreading is implemented to execute these operations in parallel without any loss of prediction performance. Unlike existing approaches, which rely on learners trained from the current chunk, En-ODDD leverages the relevant information from past ensemble members.
In the category of drift detectors [16], models use trigger mechanisms and statistical tests to identify the drifts in distribution. In DDM [9] and EDDM [7], concept drift is signaled when misclassification error rates exceed fixed threshold values. DDM works best for abrupt changes and EDDM handles gradual drifts. However, the explicit detector-based approaches are usually trained to perform well for a specific type of drift pattern and may not adapt otherwise. Moreover, they are quite sensitive to noisy streams as they lack continuous updating. Our approach uses a detector-based ensemble setting, which is updated at regular intervals of batches and detects all possible kinds of drifts. Moreover, our approach performs well even on noisy data streams.
In windowing-based detection techniques, models detect the concept drifts by using forgetting mechanisms [8]. A sliding window, which considers current instances as the training dataset, is the commonly used approach.
Very fast decision tree (VFDT) [17] is an induction algorithm that modifies the decision tree without storing instances once they have been used to train the model. Furthermore, CVFDT [18] was proposed, which had fixed-sized windows to locate aged nodes. However, these approaches depend largely on the selection of optimum window size to give good accuracy.
The work in [19,20] highlighted diversity's impact on the prediction capability of ensemble models.
Diversity for dealing with drifts (DDD) revealed that ensembles with multiple diversity levels perform differently for various kinds of drifts [5]. However, these approaches fail to provide faster recovery from concept drifts in longer durations of time. Our approach implements drift detection along with randomization of subsamples, which leads to adaptability to drifts in the long term, as well.
The algorithm En-ODDD presented in this paper incorporates online drift detection in a diversified adaptive setting while pruning the nonperforming experts at fixed intervals. Furthermore, different levels of randomization have been applied to produce the best prediction results in the above setting. In the following section, details of our proposed algorithm are discussed along with its characteristic features.

Proposed work
This section discusses the proposed approach, En-ODDD, which uses a drift detector in the online setting.
As seen in Figure 1, En-ODDD uses an ensemble-based model where a Hoeffding tree is used as a base expert for building the ensemble. Initially, each incoming instance is added toĈ n , the current chunk, until the maximum chunk limit |Ĉ max | is attained (Algorithm 1: lines 1-3). Online bagging, which manipulates the input instance, induces diversity in the base experts. The base expertX e is updatedk times (obtained from the P oisson(λ) distribution) with the current instance (lines 5-8). As stated by Oza in [21], when the number of instances used for training tends to infinity, the binomial distribution of thek value tends to P oisson(λ) distribution for λ = 1. Each expert has a separate drift detector, which constantly monitors the drift error rate produced during classification of the current instance (Algorithm 2). The detector uses Drif tErrW in , a sliding window of variable length, storing the prediction of the current expert.
Whenever the difference of distinctive average values between two subwindows is greater than the On attaining the predefined chunk limit (500 in the proposed scheme), the weights of existing experts are updated with the current chunk using Eq. (5) where the probabilities of all input classes are considered [3]. p k y (x) denotes the probability that an expertX k classifies x as an instance of class y . Weights of experts ( w nonLinear ) depend upon the mean square error of their misclassification and that of a randomly predicting expert, calculated on the current chunkĈ n . In En-ODDD, a new expertX en is added and one of the existing experts, with minimum weight, is pruned to maintain the consistency of the ensemble size (lines [21][22][23][24][25]. The weight assigned toX en , given by Eq. (6), depends only on the current data distribution. A small value of ϵ is added to Eq. (6) for avoiding the error of dividing by zero. After replacement of an expert, as discussed above, all the existing experts are updated using online bagging andĈ n is reinitialized (lines 26-30). In due course of time, after a stable duration, most of the ensemble experts become essentially identical as they are trained on similar data instances. Here, diversity has been introduced by employing bagging to provide higher generalization accuracy among the experts.

Algorithm 1 Ensemble-based online diversified drift detection (En-ODDD).
⟨x i t , y i t ⟩ : i th training instance at time t with the feature vector ⟨x i ⟩ and class label y i from dataset D C n : current chunk of instances Ê : ensemble of expertsX e Drif tErrW in : drift error window corresponding to the ensemble experts DetectDrift: method to detect drift |Ĉ max | : max size of chunk |Ê| : current size of ensemble L: maximum ensemble size where L ∈ N for all expertsX e in ensemble Ê do 5:k = P oisson(λ) 6: ifk > 0 then 7: weightedInstance ω b = weight(b i ) *k 8: update the expertX e with ω b 9: end if 10: end for 12: if drif tSignal == true then 13:î d = LocateMaxDriftErrorExpert(Ê) 14: reset the learning expertX e [î d ] 15: reset the Drif tErrW in for the expertX e [î d ] 16: end if 17: if b i mod |Ĉ max | == 0 then 18: for eachX e in Ê do 19:ŵ e = W nonLinear using Eq. (5) 20: end for 21: construct the new expertX en onĈ n and calculate its weight using Eq. (6) 22: if |Ê| ≥ L then 23: remove worst expert with minŵ e from Ê 24: end if 25: addX en to the Ê 26: for eachX e in Ê do 27: TrainWithBagging(X e ,Ĉ n ) Reinitialize theĈ n 31: end for 32: Output final hypothesisĤ :

Methodology
In this section the datasets used and the methodology adopted to perform a comparison of different concept drift-handling techniques are discussed.

Datasets
All the datasets, artificial and real, used to analyze the proposed approach have been described briefly in Table  2. Twelve artificial datasets have different variations of drifts simulated in them. Three real datasets, Covertype, Poker, and Weather, commonly used in the concept drift domain, have been considered for experimentation and evaluation purposes.
Poker is generated by varying the combination of suits and ranks of the five playing cards drawn from a standard deck consisting of 52 cards [14]. It has ten predictive attributes (5 cards × 2 attributes-rank and suit) along with one more attribute known as poker hand. This value is inferred after identification of the value of the five cards in the game. A total of 25,000 instances are produced from this dataset.
The Covertype dataset is based on cover type information of forests obtained from the US Forest Service's regional resource information system data. Fifty-three cartographic variables define the examples of this dataset. Instances may belong to one of the seven cover types based on the cartographic variables, and 581,012 instances and 54 attributes represent this dataset [4].
Weather is based on records compiled by the US National Oceanic and Atmospheric Administration over 50 years of 9000 weather stations worldwide [22]. It is a meaningful real-world dataset, having a diverse and extensive range of weather patterns along with meteorological data like temperature, wind speed, etc., making it suitable for long-term prediction and drift problems.

Experimental setup
This section presents the empirical study conducted for comparison of results of existing classifiers with the proposed approach. Experiments were performed using the MOA framework [23], a tool extensively used in the data stream domain to analyze streaming approaches. A machine equipped with an Intel Core and i7-6700 CPU @3.41 GHz processor having 8.00 GB of RAM has been used. An initial study was conducted, which indicated that using a large number of classifiers does not increase the accuracy; instead, it increases the time requirements. Taking this point into consideration, the number of learners considered in ensemble-based approaches is 10, with a Hoeffding tree as the base learner of ensembles with δ = 0.01, n(min) = 100, and t i = 0.05. Chunk size of |d| = 500 is used for all datasets since this size value is considered as the minimal suitable size for block-based ensembles. The ensemble experts are trained in parallel using separate individual threads, which reduces the training time considerably. For a meaningful comparison between different algorithms, the same parameter values have been set as stated above.

Evaluation using different diversity levels
To leverage the bagging performance, En-ODDD was tested by introducing different levels of diversity. As λ is the parameter that largely influences the diversity, its impact on predictive accuracy was verified by tuning it on training data. Table 3 presents the average accuracies obtained by performing 8 preliminary executions using λ = 0.01, 0.1, 0.5, 1, 1.5, 2, 2.5, and 3 on each dataset. Additionally, the plot in Figure 2 depicts that prediction accuracy in most of the considered datasets increases until λ = 1.5. However, the performance of En-ODDD tends to converge for λ > 1.5. Thus, for analyzing the performance of the proposed approach, a value of λ = 1.5 has been considered in all experiments.

Evaluation using interleaved test-then-train method
In this methodology, every instance is used first for evaluating the existing classifier before using it for the update process. However, the classifier is always tested on unseen instances [16]. Plots between the number of processed instances and classification accuracy are drawn to examine the effect of the underlying classifiers on concept drift. Accuracy has been evaluated as the percentage of instances classified correctly over the total number of instances. Tables 4 and 5 present the results of the average accuracy and training time of the algorithms evaluated.

Parametric configuration
En-ODDD has been compared with both online and block-based algorithms by analyzing various performance metrics as given in Table 6. To implement the ensemble-based approaches, the default values of the parameters were used as per their configuration. In DWM, the factor to penalize experts ( β ) is set to 0.5, minimum fraction weight ( θ ) is set as 0.01, and the value for the period between removal of the expert (p) is set to 50. In AWE and AUE2, the number of classifiers to learn is set as 10 with a block size of 500. Various values of 250, 750, and 1000 were also tested for the same but 500 provided the best average accuracy. The DDD algorithm is chosen for experiments with λ l = 1 and λ h = 0.1, as it is an important approach considering the diversity of ensembles. LearnNSE, online bagging, and leverage bagging are chosen as they are efficient representatives of the ensemble-based online approaches.

Results and discussion
The performance of different algorithms under varying drift patterns for the datasets considered for evaluation is presented. Due to limitation of space, we show the most interesting plots.
Experiments with wave generator: Figure 3 shows the accuracy achieved on the W ave abr dataset. The best performing algorithms here are En-ODDD, DDD, and Oza, closely followed by AUE2 and LevBag. ARF, WMA, and LearnNSE have relatively shown loss in performance. Here the first drift has major influence on the accuracy, which seems to stabilize later. Probably the use of an explicit drift detector in En-ODDD works best with abrupt changes and is the reason why the performance of the rest of the algorithms may not be as good as that of this hybrid technique. ACE has shown poor results, however, despite the presence of a drift detector, which may be attributed to the fact that it lacks pruning of poorly performing classifiers that are not updated from time to time. As seen in Figure 4, En-ODDD and Oza show the most accurate results for mixed dataset W ave mix comprising two gradually moving concepts separated by an abrupt drift at 500k instances. Since AUE2 is not well equipped with any explicit drift detection, it performs more poorly compared to other diversity-based approaches like DDD, ARF, and En-ODDD. However, DWM and LearnNSE handle this change without largely affecting performance due to their adaptive nature. ACE shows a steady performance with mixed drifts without much rise or fall near drift points.   [22] Ensemble Block Implicit -a: sigmoid slope b: sigmoid infliction point Oza [21] Ensemble Online Implicit -l : base learner option DWM [12] Ensemble Online Implicit β: factor for decreasing weights of experts (0 ≤ β < 1) p: period between expert removal θ: threshold for deleting experts ACE [24] Ensemble Online Explicit α: confidence level factor µ: adjustment factor of ensemble S a : short-term memory size WMA [13] Ensemble Online Implicit -l: learner option list β: penalty factor for experts γ: minimum fraction of weight per model p: pruning factor AUE2 [3] Ensemble Block Implicit -n: maximum no. of component classifiers in ensemble c: chunk size m: maximum byte size of ensemble memory LevBag [25] Ensemble Online Implicit λ l : to maintain diversity ensemble l: base learner option ARF [6] Ensemble Experiments with Agrawal generator: En-ODDD performs well at all the three consecutive drift points, i.e. at 250k, 500k, and 750k instances, with the accuracy of LevBag being slightly lower than it. The capability to sustain accuracy by both is possible due to the presence of diverse ensemble components. The explicit drift detector in En-ODDD and ACE helps them to recover from the drop in accuracy after the first drift. ACE shows a steady accuracy near the drift point. DDD handles itself efficiently as compared to the others but most of the algorithms like LearnNSE, DWM, and even AUE2 are severely affected by the first drift point. Even ARF, which usually stabilizes itself after drift points, has shown a fall in accuracy largely after the first drift.
Experiments with RBF generator: In this dataset, there is an interesting use case of four alternating drift points at 125k, 250k, 500k, and 750k. En-ODDD, DDD, LevBag, and Oza have performed in similar manners in recovering from these drifts. WMA and DWM failed to adapt to mixed drifts due to the absence of explicit drift detectors. Despite the difference in the accuracy levels of ARF and LearnNSE being large, both have shown a similar recovery nature after the drifts, possibly because of strong time similarity in data being used for classification. In the case of RBF grdl_rec there is a slight difference in the accuracy of diversity-based online algorithms DDD, LevBag, Oza, and En-ODDD. In this scenario, there is no single best performing approach. ARF has shown a severe drop in performance after 500k instances, which clearly shows it is not suitable in gradually recurring drift scenarios.
Experiments with tree generator: Figures 5 and 6 illustrate the performance of classifiers on the T ree grdl_rec and T ree abr_rec datasets, respectively. In the T ree grdl_rec scenario the speed of recurring changes plays an important role. Although DWM has shown better adaptation to drift, En-ODDD performs quite well by achieving better accuracy as compared to DDD for high speeding drift. Interestingly, LevBag outperformed the others before the first drift point but showed a major drop in accuracy after that. AUE2 performs similar to En-ODDD due to removal of buffer classifiers. Both Oza and LevBag do not adapt themselves to the recurring drifts after every 250k instances as they lack any pruning mechanism. Results indicate that the diversified ensemble without any drift control strategy is not enough to handle such situations. LearnNSE, WMA, and ACE do not react well to recurring changes irrespective of speed of change. In the T ree abr_rec dataset, abruptly recurring drifts are simulated after every 200k instances. En-ODDD, closely followed by the LevBag algorithm, performs efficiently in this case. DDD and Oza have failed to respond to recurring drifts well. The absence of a drift detector in Oza is a reason for poor adaptation to suddenly recurring concepts. WMA has shown drastic decrease in accuracy. Furthermore, algorithms like LearnNSE and ACE, which do not prune their poorly performing components, show decreases in accuracy. Experiments with real datasets: With all three real datasets, Poker, Weather, and Covertype, En-ODDD performs consistently better than all the algorithms considered for comparison. Efficient performance is accomplished because of the generalization in classification error produced due to diverse components. From Figure 7, it is evident that adaptive approaches like AUE2, LevBag, ARF, LearnNSE, Oza, and DWM perform relatively better in this case as compared to simulated drifts. The combination of online and adaptive approaches has helped En-ODDD achieve the best results. A significant drop in accuracy is observed in the DDD algorithm. The combination of low and high diversity ensembles does not cater to drifts in this real dataset. As with most of the artificial datasets, ACE continues to be a poor performer. Figure 8 analyzes the Poker dataset. There is a sudden increase in the accuracy of the En-ODDD and LevBag algorithms after 10k instances. ARF and DDD are also closely following them. However, other approaches like WMA and AUE2 show a consistent performance with no increase in accuracy at any point of time. DWM, LearnNSE, and ACE are the worst performing algorithms on this dataset.

Interpretability of the proposed approach
In the proposed approach, bagging is employed, which provides higher generalization accuracy to the ensemble system. It creates varied training sets of random subsamples for continuous improvement and outputs weighted aggregated results. However, as the process involves randomization, it sometimes makes the experiments less interpretable in a single execution. Thus, to verify its correctness and reliability, ten repetitions of En-ODDD were done for each dataset. Table 7 presents the results of average accuracy along with mean ± standard deviation obtained over multiple runs to check interpretability. Furthermore, statistical tests were also performed for analyzing the varied performance over multiple datasets. It can be concluded that the deviation among multiple runs is not significant, making the approach reliable and trustworthy.

Trade-off analysis
Introduction of bagging in our approach increases the predictive accuracy of the underlying ensemble as better training is achieved due to diverse learners. Figure 9 demonstrates that En-ODDD achieves the best accuracy among all the algorithms compared for all the datasets. However, in Figure 10 we have compared the training time that was obtained for all approaches. It can be seen that for ARF, ACE, Oza, DDD, NSE, and LevBag, training time is much more than that of the proposed approach. Three algorithms, WMA, DWM, and AUE2, take less time in comparison to En-ODDD, but the accuracy achieved by our approach is more than all these. Though bagging encounters some overhead in random subsampling, the increase in accuracy is much more significant compared to it. Also, this time is reduced by the usage of multithreading, where base learners are trained in parallel while updating.  However, WMA and DWM took less training time than En-ODDD because they do not continuously update their existing ensembles; rather, they just use a pruning mechanism. Hence, they provide very low accuracy for all drifting streams (Table 8). AUE2 lacks a diversity generation strategy and therefore takes less time than our approach, but at the cost of accuracy. ACE in an online setting has an inbuilt detector, which accounts for the major training time. ARF has hyperparameter tuning, which is a time-consuming process. Thus, En-ODDD manages to maintain a balance between achieving high accuracy and feasible training time. Figures 11, 12, 13, and 14 depict the evaluation time taken by various algorithms for gradual, abrupt, mixed, and real drifts, respectively. It is observed that with increase in processed instances, En-ODDD adopts a linear increase in time, which is comparatively less than other approaches. Constant updating and a weight-based detection system help En-ODDD to encounter concept drift of all types, which is otherwise difficult to handle. Since the online concept drift problems have a major focus towards achieving higher accuracy, it can be concluded that En-ODD is trustworthy.

Statistical analysis
To compare various algorithms and to show if there exist significant differences among them, it is essential to give statistical test support. This paper investigates the usage of Friedman and Wilcoxon tests for machine learning methods [4,6,14]. The null hypothesis for the experimental design suggests that there exists no significant difference between the prediction performances of the algorithms tested. Post hoc analysis using the Bonferroni-Dunn test [26] is performed in the case that the null hypothesis is rejected. The F f statistic value ranks separate methods based upon the average results [27]. The lowest rank is given to the best performing approach and vice versa. As stated by Eq. 7, average ranks (R ) and the Friedman statistic ( χ F 2 ) are computed (using N datasets and k algorithms). DDD, the Wilcoxon test was performed since their average accuracy ranks were higher than that of En-ODDD. The P-values obtained were: P DDD = P AU E2 = 0.0006 and P LevBag = 0.0011 . These values indicate that En-ODDD is better in terms of accuracy as compared to all other algorithms considered.
For the training time, the F f statistic returned 32.87, indicating rejection of the hypothesis. By comparing the average ranks obtained in Table 8 and performing the Wilcoxon test, it can be concluded that En-ODDD is faster than DDD, Oza, LevBag, and LearnNSE ( P DDD = 0.001, P Oza = 0.008, P LevBag = 0.0005, P LearnN SE = 0.001) but slower than AUE2, DWM, and WMA. The tested En-ODDD algorithm outperforms in classification accuracy in the presence of all possible drift scenarios.

Threats to validity
This section discusses various potential threats to the validity of this study along with some mitigations taken to reduce their impact on our work. Though the performance of the proposed approach has been proved on a reasonable number of instances (∼1M), it can be further validated by increasing the number of instances.
Therefore, the analysis of increasing the scalability of this approach is left for future work. Another threat could be the misinterpretation of the actual relationship between the predicted variable and predictors, which is caused because of not evaluating statistical results [28]. However, this threat is removed in this study by using two nonparametric tests, the Wilcoxon and Friedman tests, at a confidence level of 95% to statistically validate the results obtained. Finally, although the proposed technique exhibited encouraging prediction performance on various drift patterns like gradual, abrupt, and recurring, it requires further investigation of a combination of drift scenarios where multiple types of drift coexist.

Conclusion and future scope
Concept drift handling is a challenging task where complexity increases, especially when dealing with heterogeneous drifts. This paper proposes and evaluates an incremental learning algorithm, En-ODDD, which ensures a timely reaction to concept drift by using an ensemble of diversified experts embedded with an active drift detector. A combination of a majority weighting mechanism and online drift detector in En-ODDD covers all possible drift scenarios: gradual, abrupt, recurring, and mixed. The introduction of diversity by modifying the incoming training instances using online bagging and a diversified update mechanism is the primary reason for En-ODDD outperforming most algorithms in terms of accuracy. An explicit drift detector enables En-ODDD to identify abrupt drifts quickly, without waiting for the chunk cycle to complete. Moreover, the updating of experts at regular intervals of time helps the model to adapt better to gradual drifts. The impact of diversity parameter λ is also investigated by testing En-ODDD with different values. Best results were obtained when λ = 1.5. After analyzing the results of various pruning strategies it can be concluded that substituting the worst performing expert is the best option. En-ODDD is compared with 10 state-of-the-art ensemble-based algorithms. An empirical study performed using 12 artificial and 3 real datasets proves that En-ODDD provides stable accuracy performance under all drifts, which is better than the compared algorithms. Also, the statistical tests suggest that En-ODDD achieves higher accuracy under different drifting streaming conditions.
In the future, we plan to extend our work by exploring different techniques to introduce diversity other than bagging. We are also interested in developing strategies that can handle semisupervised streams of data where labels of all the data instances are not available beforehand. Integrating existing approaches with big data frameworks like SPARK or Hadoop to improve the scalability of the proposed approach could be another line of research.