Big Data: Controlling Fraud by Using Machine Learning Libraries on Spark

: Continuous changes and the high calculation volume in network data distribution have made it more difficult to detect abnormal behaviors within and analyze data. For this cause, large data solutions have gained important. With the advancement of internet technologies and the digital age, cyber-attacks have increased steadily. The k-Means clustering algorithm is one of the most widely used algorithms in the world of data mining. Clustering algorithms are algorithms that automatically divide data into smaller clusters or sub-clusters. The algorithm places statistically similar records in the same group. In this article, we have used k-Means method from the Machine Learning libraries on Spark to determine whether the incoming network values are normal behavior. 400 thousand network data were used in this article. This data was obtained from KDD Cup 1999 Data. We have detected 10 abnormal behaviors from 400 thousand network data with k-means method.


Introduction
Continuous changes and the high calculation volume in network data distribution have made it more difficult to detect abnormal behaviors within and analyze data. For this cause, large data solutions have gained importance [1]. In 2016, the US Department of State Security Network Security Dept. budget request was $ 479.8 million [1,2]. Norton has reported that victims of the cybercriminals have spent $ 126 billion globally since 2015 [1,3]. The increasingly intelligent, complex and destructive nature of cybercrime has led to an increase in these large cybercsecurity investments. Therefore, it is necessary to identify the abnormalities in the network data with the help of a computer in order to reduce these investments and to provide better security of the country.
Spark is an open-source platform developed at UC Berkeley AMPLab in 2010. The goal is to perform iterative and efficient computation on large datasets. [4]. Abnormal behaviors can occur in many industries (banking, insurance, network security, etc.). These; forgery on credit cards, forgery on insurance policies, abnormal packet exchanges on the network or potential attacks. Such cases are called fraud or anomaly detection. And it can cause some problems in every sector (such as material losses, reputation losses). Cyber-attacks with the advancement of Internet technologies have increased steadily. These attacks use short networks of networks to access unauthorized to sensitive information. It is very important to intervene before any attack takes place. Big data is the data community that includes a diverse type of datasets [1,5]. Network-based intrusion detection is required to detect unusual behaviors of network users. And it is necessary to perform large data analyzes when making this determination [1].
A threat or attack attempt is to change the information that unauthorized people access and information of the system. They create an anomaly behavior [1,6].
Fraud detection, attack detection, and prevention of data leakage are anomaly detection approaches [1,7]. When the literature is examined in terms of detection of large network anomalies, it can be seen that extensive data network analysis has been performed, such as CTU-13 data [1]. There are many anomaly detections works in the literature. These are network-wide anomaly detection with PCA [8], Kalman filters [9], single-link traffic measurements [10]- [11], Hough Transform [12], sketches [13], and equilibrium properties [14] methods [15]. The study [16] provides a comprehensive summary of techniques for detecting general anomalies [15]. Another study [17] has a specific questionnaire on detection and detection of an anomaly in Internet traffic [15]. Lakhina et al. [18] have used the PCA-based method [19] but they used several entropy metrics based on source. They are advised to reuse these entropy measurements to classify the anomalies [15]. Xu et al. [20] have applied clustering to entropy metrics similar to [21] to classify abnormal events and construct a traffic model [15]. Fernandes et al. [22] have suggested NADA, a signature-based tool that separates abnormalities into different categories [15]. Silveira et al. [23] have suggested URCA, a method for determining the underlying causes of abnormal events [15]. Spark also has very few studies done with k-means. R. Kumari et al. arc. have discussed how these interventions can be detected by using k-means clustering-based machine learning approach using large data analytical techniques, and that the attacks advocate experimental results for multi-core prevention [24]. In one study, abnormal behaviors have been detected using the Principal Component Analysis (PCA) method. The accuracy result is 96%. This approach has been implemented on public Net Flow data [1].
Theory and Method of the study are given in Section 2. Experiment results are described in Section 3. And conclusions are explained in Section 4. The purpose of this article is to determine what is different from the millions of network movements. The novelty of this article, anomaly detection in spark is performed on KDD Cup 1999 network data using k-means. K-means is used on many platforms. Spark is a current issue in the big data field. And so far it has not been applied for anomalous detection on KDD Cup 1999 network data

Theory and Method
Network movements (datasets) are than subjected to a normalization process after they are collected. In this process, abnormal data is removed from the system in order not to be used in modeling. The model is being created after normalization. In the next stage, anomaly detection is made by asking every data to the data model. The algorithm we'll use here is k-Means. K-Means is in Spark's Machine Learning library. If attention is paid to the clusters, similar ones are gathered together. Some points will stay away from the center of the clusters. These points will be defined as abnormal movements.

What is k-Means?
The k-means method, a multivariate statistical technique, is used to classify homogeneous subgroups according to their similarities. One of the most well-known clustering methods is the k-means method. [25]. In this method, we start with the determination of the centers of the predetermined A units and each variable is assigned to the nearest cluster center according to the similarity [26]. After assigning each variable in the input data set to a cluster, the cluster center is recalculated for each cluster so that the variables can be assigned to different new clusters depending on the location of these new cluster centers. This process is repeated until there is no change in cluster membership. In an examined problem, a T data set with K feature vectors and n variables can be defined as = { | = 1,2, … . . }. In this data set, k. the feature vector can be written as = [ 1 , 2 , … . . ] , ∈ [25,27]. In Equation (1), the data set is divided into the smallest cluster. For the calculation of the distance measure, the Euclidean distance criterion is given in Equation (2) is used [27,28].
The equation given in Equation (1) 2 is defined as follows. (3) A key advantage of Apache Spark for k-Means is that its machine learning library (MLlib) and its library for Spark Streaming are built on the same core architecture for distributed analytics. This facilitates adding extensions that combine components with novel ways [29] K-means in Apache Spark is a cluster computing platform that is used for general purposes and designed to be fast. On the speed side, Spark for k-means expands MapReduce model to support more types of computing like interactive queries and stream processing. Speed in spark for k-means is very important inprocessing large datasets, as it means the difference between exploring data interactively and waiting minutes or hours. One of the main features that Spark for k-means proposes for speed is the ability to compute in memory [30].

Data processing
Each point has x and y values. We define random centers on these points. By doing iterations at a later stage, a new center point is determined according to the distance of the points. In the last stage, the structure we build will be our model. And the incoming data will now be interpreted according to this model. We need a datasheet for our article. We used KDD Cup 1999 Data for this [31]. When we look at the data, we see that each line shows the details of data exchange on the network. The KDD Cup names in the URL we give will give you a list of which columns these lines correspond to. For example:

Coding
The program code used in this article consists of 5 steps.
Step 2: Create a Maven article on Eclipse.
Step 3: After creating the article, our first task will be to configure our pom.xml file. The dependency libraries we will use in the article will be loaded into the article. [32] With spark-core_2.10, spark-sql_2.10, spark-mllib_2.10 we have added spark's machine learning libraries. bettermonads are used to translate json objects into java classes.
Step 4: Creating the class Kdd.java. Our goal is to convert every line in the dataset to a java object. [33] Step 5: We are making the following improvements on the main class called App.java. Each section is described in the comments line what it did. [34]

Experimental Results
When the program code is run, the following 10 results have been seen on the screen. The 10 anomaly values founded on the KDD Cup 1999 data are below.  0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.  0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,27.0,1.0,0.0,1.0,0.22,0.0,0.04 When we examine the dataset we use, we see that the first value after SF is usually below 1000. We see that records that are detected as anomalies are in 50.000. This indicates that something else is happening within the network. With the k-Means algorithm we use, it is possible to evaluate the given data according to a model and to take necessary actions.

Discussion and Conclusion
Until today, many field machine learning techniques have been used [35][36][37][38][39][40][41][42][43][44][45][46]. Cyber-attacks with the advancement of Internet technologies have increased steadily. These attacks use short networks of networks to access unauthorized to sensitive information. It is very important to intervene before any attack takes place. Big data is the data community that includes a diverse type of datasets [1,5]. Network-based intrusion detection is required to detect unusual behaviors of network users. And it is necessary to perform large data analyzes when making this determination [1]. A threat or attack attempt is to change the information that unauthorized people access and information of the system. They create an anomaly behavior [1,6]. Fraud detection, attack detection, and prevention of data leakage are anomaly detection approaches. Continuous changes and the high calculation volume in network data distribution have made it more difficult to 0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 8,8, 0.00,0.00,0.00,0.00,1.00,0.00,0.00, 9,9,1 detect abnormal behaviors within and analyze data. For this cause, large data solutions have gained importance. With the advancement of internet technologies and the digital age, cyberattacks have increased steadily. The k-Means clustering algorithm is one of the most widely used algorithms in the world of data mining. Clustering algorithms are algorithms that automatically divide data into smaller clusters or sub-clusters. The algorithm places statistically similar records in the same group.
Our study have been compared with similar studies in the literature. The results in similar studies in the literature are below.
In a article study [47] has been suggested the parallel version of K-means implemented on Hadoop MapReduce1. According to this, when spark has been compared with Hadoop, Spark is more suitable for parallelizing the iterative algorithms such like K-means. The distributed memory abstraction called as resilient distributed datasets (RDDs) can be cache both intermediate data and input data in memory [48,49]. it has been discussed how to parallelize K-means-based algorithms on Spark in a paper [49]. According to this, K-means-based clustering algorithms include two iterative procedures: centroid updating and distance computation. Also, technical details of two phases have been discussed. Especially, it has been given an implementation detail for parallelizing K-Means-based clustering on Spark. Further, it has been illustrated the their alternative strategies and technical barrier for each step. Experiments on text datasets and large-scale UCI datasets have been suggested that the effectiveness of the algorithms demonstrated [49]. In another paper [50], it have been presented a new K-Means based algorithm implemented on Spark. The this algorithm has been indicated to automate the input of number of clusters in advance, which is the major drawback of the classical K-Means algorithm. The proposed algorithm has also been indicated to tackle the resolution problem. it have been shown with experimental results that proposed algorithm works efficiently on large scale data sets and outperforms the K-Means algorithm implemented in Spark Machine Learning Library. Moreover the algorithm has been scaled gracefully on adding more machines to cluster and increasing the data size. In another article [51] have been successfully designed intelligent k-means based on spark. It has been runned in Hadoop environment. the algorithm has been designed using batch of data. Also, it has been compared with the version of algorithm without using batch of data. It has been suggested with experiment results that design can be speed up computational time in big data problem. In addition, authors have suggested that design have higher silhouette value than original kmeans using synthetic data. In this article, we have used k-Means method from the Machine Learning libraries on Spark to determine whether the incoming network values are normal behavior. 400 thousand network data were used in this article. This data was obtained from KDD Cup 1999 Data. We have detected 10 abnormal behaviors from 400 thousand network data with k-means method.
In future work, anomaly detection in spark will perform on KDD Cup 1999 network data using different methods.