Detection of different windows PE malware using machine learning methods Makine öğrenimi metotları kullanılarak farklı

Graphical Abstract According to the information in Figure


Detection of Different Windows PE Malware Using Machine
Learning Methods

Figure. A general summary of the obtaining the AyEs Dataset Aim
Detection of malware attacks on Windows systems using machine learning methods.

Design & Methodology
In the study, a testbed was prepared by using virtualization technologies such as VMware Workstation. Malware attacks specific to the vulnerabilities of the Windows system were prepared by using msfvenom and meterpreter tools and these attacks were implemented. Weka tool was used to examine the effects of attacks and to detect attacks. Machine learning methods such as Naive Bayes, J48, BayesNet, IBk, AdaBoost and LogitBoost were used to detect malware attacks.

Originality
Six different malware attacks have been prepared and implemented specifically for Windows systems. Two different datasets were created by collecting the obtained data. While analyzing the datasets, models have been proposed for two different detection systems, whether there is an attack or not and the attack type is determined.

Findings
Our study achieved 98.45% accuracy for the "Normal State-Attacked State" dataset with the J48 algorithm. For the "Attacked State" dataset, it got the best classification result with a success rate of 90.46% using the IBk algorithm.

Conclusion
In our study, contributions are made to the literature by preparing a testbed, obtaining a two-stage dataset, performing two different attack detection processes and providing high performance in attack detection.

Declaration of Ethical Standards
The authors of this article declare that the materials and methods used in this study do not require ethical committee permission and/or legal-special permission.

INTRODUCTION
The development and widespread use of the Internet facilitates and accelerates many jobs in the Information Technology world. Unfortunately, these developments also bring with them large-scale security problems. As the importance of information, data and processes increases, possible attacks and possible damages increase. Cyber attacks can have effects that can cause serious financial damage, leaking of confidential information or loss of trust. For these reasons, providing cyber security has become a necessity. An increasing number of cyber attacks are being carried out against the systems where data is stored or processed, the users using these systems, or the data transmission paths. Attacks are developing, differentiating, and the effects of attacks are increasing. One of these attacks is malware (or malicious software) attacks. These can be different types of attacks, such as ransomware, trojans or viruses. Malware is a type of malicious software and was created specifically to damage the systems. The malware aims to disrupt system operations and steal sensitive and confidential information. It is a piece of code that can be added, removed or changed in software [1,2]. Malware attacks appear on important issues such as personal information, bank account information or e-mail account information. Important information like this can be stolen, modified or deleted by attackers with malware attacks. Malware can infiltrate the system by taking advantage of security vulnerabilities in the network, causing significant damage, especially to institutions and organizations [3]. Therefore, protection from malware attacks is one of the important issues in terms of providing cyber security. In order to be protected from attacks, the attack must be detected and defined first. The issue of attack detection is taken seriously in the world of cyber security, and many studies are being conducted on this subject. In particular, machine learning, deep learning or artificial intelligence methods are used for malware detection. In the study, it is aimed to use machine learning to provide cyber security and to detect malware attacks. A testbed was created for this. Here, it is focused on the cyber security vulnerabilities in the local network and how to ensure security. For this, existing Windows 10 security vulnerabilities were used to seize or damage a Windows 10 computer. Portable Executables (PE) files, which are frequently used in systems, can offer convenient and usable ways to implement security threats. Therefore, PE File format was chosen in the study and malware attacks were carried out with this file. In the testbed, a special malware attack was prepared on the victim system and the attacks were carried out. Detailed analyzes were made by considering the effects of the attacks one by one. The results showed that the victim system was hacked and the attacks were successful. In order to detect the attacks, a new dataset was created by combining the network data that includes the attack types and the network data that does not contain the attack. In addition, the dataset containing the attack types was also evaluated within itself. Machine learning methods have been studied on these datasets and classification processes have been carried out to detect attacks. When the studies in the literature are examined, it is seen that there are many studies on malware detection. For example, Huang et al. conducted a study using a deep learning method to detect malware for the Windows7 operating system. They used a ready-made dataset to test the proposed malware detection method and were able to detect 94.70% of attacks [4]. Upadhayay et al. combined 3 different datasets in their study. These datasets are Genome, Drebin and Koodous datasets. In their datasets, they listed the permissions given in network traffic as normal and malware. Afterwards, 3 different detection methods were applied to this dataset. These methods are static, dynamic and hybrid detection methods. In addition, machine learning algorithms were applied to each method. The highest accuracy rate of 95.96% was obtained with the Support Vector Machine (SVM) algorithm applied in hybrid detection [5]. Krcal et al. used machine learning method to detect malicious PE files for Windows. One of the PE files datasets provided by Avast was used. Feedforward network method was used and according to the results obtained, 96.0% successful results were obtained for detection [6]. Diaz et al. used the Sophos-ReversingLabs 20 Million Dataset for non-signature-based malware detection for Windows operating systems. A combination of Long Short-Term Memory (LSTM) and LightGBM was used for the classification process. With this method, an accuracy rate of 91.73% was obtained for detection [7]. Mohan et al. aimed to detect malicious software for Windows. The dataset used in the study was created with the Dtrace tool and the feature selection method was applied to the dataset. In this dataset prepared for machine learning use, Decision Tree (DT) and Random Forest (RF) algorithms gave the best accuracy result with 97% [8]. Irshad et al. created a dataset to detect malware in Windows security. After the feature extraction processes were done, the algorithm was applied. Three different algorithms were used for malware detection. Among these, the algorithm with the highest accuracy is RF, with 86.8% [9]. Anderson et al. created the Windows PE malignant and benign files themselves. And they named this dataset EMBER. In their study, MalConv and LightGBM methods were compared and higher accuracy was achieved with LightGBM [10]. In the study, malware attacks against Windows systems were detected by using machine learning methods. There are five main titles in the study. In the first title, there is an introduction and information about the studies done in the related field. In the second title, information about the testbed created and used in the study and the preparation of attacks are given. In the 3rd title, the execution of prepared attacks and obtaining the dataset after the attacks are carried out are explained. Detection of the attacks against the testbed, the results of the analysis and discussion sections are in the 4th title. In the last title, the results of the study and interpretation information are given.

TESTBED
A testbed has been prepared for carrying out attacks, monitoring the effects of attacks, observing the state of the victim system and detecting the attack. Simulations have been carried out. While preparing the testbed, different tools and machines were used. VMware Workstation virtualization platform was used and virtual machines were installed on it. Settings have been made to enable them to communicate over the local network. A virtual machine with Windows 10 operating system was used for the victim system, which was the target of the attacks. This machine is allocated 4 GB of RAM and 30 GB of hard disk space. A virtual machine with the Kali Linux operating system was used for the attacker system that performs the attacks. This machine is allocated 4 GB of RAM and 20 GB of hard disk space. Before an attack is made, the characteristics of the victim system must be obtained. The IP information of the victim system was obtained by scanning the IP on the local network. Port scanning was also performed in order to access the open ports used by the victim system. After the IP addresses and port information were determined, payload production was carried out using Metasploit frame on the attacker machine. Payloads with the determined quality are produced with the msfvenom tool. Encoder operations are carried out in order to bypass the created payloads without being caught by security measures. The previously existing msfpayload and msfencode tools are combined with msfvenom.
In this study, PE type payload was generated with msfvenom tool and meterpreter. The payload produced is executable as ".exe" file. Meterpreter is short for Meta-Intepreter and is a high-end payload owned by the Metasploit Framework. There are several reasons for using the Meterpreter payload. Some of those; The meterpreter operates on RAM and does not write to the hard disk. In this way, the victim leaves as little traces on the system as possible. Meterpreter can be developed with various modules without the need for recompilation. In addition, it is quite powerful because it provides communication by dividing into channels. In addition to these, it has advantages such as command history and command completion. "windows/ meterpreter/ reverse_ tcp" is a payload generation code used to gain access to the target system by exploiting security vulnerabilities. Thanks to this payload, Remote File Inclusion security vulnerabilities are used.  Figure 1, exploit detection by exploiting the vulnerabilities of the victim system is the first step in preparing an attack. Then the payload is produced in accordance with the exploit and finally, the exploit is performed. A stager payload is run on the victim system for the attack. In our study, reverse_tcp was chosen as the stager. The selected stager payload creates a Data Definition Language and is injected into another. In the last stage, an encrypted message comes from the target system to Metasploit on the attacker system and the communications flow over encrypted traffic.

ATTACKING AND OBTAINING THE DATASETS
After the payload is created, the necessary settings are made and the target system is accessed. Then, the payload is activated with the "run" command and the target system is captured. As seen in Figure 2, unauthorized operations were carried out on the victim system. With the example of a "screenshot" attack, operations were performed on the victim system. In the study, six different meterpreter attacks were made and it was aimed at damaging the victim system. Types of attacks:  Screenshot: It is used to take a snapshot of the victim system.  Record_mic: It is used to listen to the ambient sound of the victim system for the specified time.  Vnc: Allows the screen movements of the victim system to be monitored for a certain period of time.
 Getuid: Provides the user name of the victim system to be learned.  Dir: Displays the current file directory when the victim's system was hacked.  Sysinfo: Provides victim system information (operating system, number of users, etc.).
Wireshark program was used to listen and collect the network packets of the system where Meterpreter attack examples were applied, separately for each example. Wireshark provides monitoring, analysis and optional filtering of network traffic via a graphical interface [11]. It is one of the most frequently used tools in the literature. Network flow data is a technology that allows certain parts of the information in the packet to be recorded and analyzed using special algorithms. The features that may be suitable for this study were determined by examining the KDDCUP99 dataset [12]. There are six features determined for the network flow data listened to in the study. In addition to these six columns, there is one class column. This information is shown in Table 1.  The dataset contains the port and IP information of the source machine from which the packet came and the destination machine from which the packet went. In addition, there is the communication protocol used while transmitting the packets and the size of the transmitted packet. The Meterpreter Type column is used to make the classification. This column indicates the attack type and plays an important role in the analysis part.
The data acquisition part of the AyEs dataset prepared in this study is two-stage. Figure 3 shows the process of obtaining the AyEs dataset.

Figure 3. Obtaining the AyEs dataset
According to the information in Figure 3, the stages are as follows: Stage 1:  Normal operating state data of the victim system before the attack  Total attack state data were obtained by performing six different attacks separately By combining these two datasets, the "normal stateattacked state" dataset was created. In the normal state there are 26856 lines of samples and in the attacked state there are 77184 lines of samples. The total number of samples for these two cases is 104040. Stage 2:  "Attacked state" data was obtained by performing six different attacks separately This "attacked state" dataset consists of 77184 rows of samples. Information about the attacks used in Stage 1 and Stage 2 are given in Table 2.  Table 2 shows the total number of samples obtained by attack types. According to this, when the sample lines consisting of network packets are examined, it is seen that the maximum number of samples belongs to the normal state with no attack, with a rate of 26%. Looking at the attack types, the "dir" attack has the highest number of packet samples with a rate of 31% compared to all attacks. The least applied type of attack is the "getuid" attack, with a rate of 12%.

ATTACK DETECTION AND DISCUSSION
Transforming large amounts of data into meaningful information as a result of various analyzes is necessary for intrusion detection systems. There are several methods that achieve this. One of these methods, data mining was used in the study. Weka tool was used to process the obtained datasets, evaluate the data statistically, and draw a meaningful conclusion between the patterns. Weka is a useful tool for analysis operations such as data classification, clustering and regression [13,14]. It is used to test the performance of many algorithms in cyber attack detection studies.
In this study, the datasets created for the Weka program were converted to ".csv" format. Datasets were primarily preprocessed. Noisy or null data has been cleared.
Columns that had no effect on the analysis, such as Time and No, were deleted. Thus, the data is ready to be processed. The steps of the attack detection processes performed for both datasets are given in Figures 4 and 5.

Figure 4. Attack detection stages for 'Normal State-Attacked State' dataset
According to Figure 4 and Figure 5, two different methods were used while creating the model for attack detection with data mining. These are:  Percentage Split: The data is split in certain proportions. 66% of the data was used for training the model. The remaining data were used for testing purposes.
 Cross Validation: This method is also known as "k-fold cross validation". In this method, the dataset is divided into k equal parts and testing is performed for k different sets. The k value of 10 was chosen in the study. As shown in Figures 4 and 5, six different data mining algorithms were used to perform attack detection by classification. The obtained results were compared with each other and presented in tables. These algorithms are: Bayesian Networks: It is an algorithm used to categorize or classify. Probability results are used in the classification process. Naive Bayes model uses a methodology that can achieve high-accuracy results [15]. BayesNet is a successful algorithm for making decisions in uncertain situations. In classification, the data presented for training must have a label class. In this method, probability operations are performed on the training data. The probability values obtained at this stage are used to classify the test data. The formulation of Bayes' theorem is shown in Figure 6.  For the equation given in Figure 6, the X value indicates which class the classification belongs to. The Y value represents the features in the test data. If the equation is interpreted according to these two variables:  P(X) value: It is the ratio of the number of samples with X class given in the training set to the total number of samples.  P(Y) value: It is the ratio of the number of samples with the Y feature in training set to the total number of samples.
 P(Y|X) value: It is the probability that a sample in the X class has the Y feature.
 P(X|Y) value: It is the probability that a sample with feature Y is from class X.

J48:
It is a Decision Tree algorithm, and information gain rate is used as the feature selection criterion in this algorithm [16].

Figure 7. The working principle decision tree algorithm
Decision tree algorithms provide sequential division of the dataset [17]. In order to determine the first condition, the features that are most effective in making the classification are used. And the condition is determined according to these properties. The initial condition is expressed as the root. Sub-conditions are nodes. The last layer, the classification step, is called leaves. The working principle decision tree algoritm is given in Figure 7. IBk: A K-Nearest Neighbor algorithm that uses the same distance metric. The number of nearest neighbors is a decisive factor for classification. The K number represents the number of samples to be taken from the nearest neighbors. According to the K number, whichever label has more of the nearest neighbors is selected as a result of classification [18]. This situation is illustrated in Figure 8. Different search algorithms can be used to speed up finding the nearest neighbors. Incorrectly estimated tag value is multiplied by this coefficient. Thus, the coefficients of the correct and incorrectly estimated sample numbers are equalized. In the second iteration, the same operations are applied again and the sample values are continued until the correct predicted value is reached. The working logic of iteration states is shown in Figure 9. The boosting algorithm thus aims to place a large number of weak learners in a harmonious order [19]. Algorithms to be used in classification may vary according to the definition of the problem. According to the purpose of the problem, performances can be calculated by calculation accuracy, calculation time or success metrics [20]. In order to measure success in classification models, metrics in the literature are discussed. The analysis results obtained were compared and analyzed according to these metrics. The concept of TP (True Positive) is when the value of the sample in the dataset to be classified is 1, and the classification result is 1. When the actual sample value is 0, and the classification result is 0, it is called TN (True Negative).
On the other hand, FP (False Positive) is the case where the classification result of the sample with real value of 0 is 1. Finally, if the classification result of a sample whose real value is labelled 1 is 0, then it is called FN (False Negative). The success metrics used are:  Accuracy: Indicates how accurately the classification model used predicts all values of the dataset.
 Precision: It is the ratio of correctly predicted positive cases to the sum of TP and FP. A higher ratio indicates the precision of the classification model is better [21].  Recall: This concept is equal to the ratio of the TP states to the sum of the TP and FN states.  F-Measure: It is a metric that considers the harmonic mean of the precision and recall metrics.
 TP RATE: It is the ratio of the TP status to the sum of cases that actually have a sample value of 1.  FP RATE: It is the ratio of the FP status to the sum of cases that actually have a sample value of 0 [22].  ROC AREA: Shows the accuracy of normal and attack classification for our study. The closer its value is to 1, the higher the performance of the classification method.

Data Analysis for Normal State-Attacked State
The normal state data recorded without the attack and the attacked state data obtained after applying six attack types were combined in one place. According to this dataset obtained, the analysis of whether there is an attack on the Windows system has been made. The results obtained are given in Table 3. According to Table 3, six different algorithms were used to classify for attack or normal states. Two different test methods, Cross-Validation and Percentage Split were applied for each. Accordingly, the algorithm with the highest classification success rate (accuracy) is J48 with the Cross-Validation method. The J48 algorithm gave the best results in determining whether there was an attack on the obtained dataset. The time spent by the algorithm for analysis is above the average. AdaBoost algorithm showed the worst performance in attack detection. The success rate for this algorithm is low for both methods.
The best results obtained with the success metrics in the classification models are given in Table 4. Considering the Precision, Recall, F-Measure, TP Rate, and ROC Area metrics, the highest values were obtained with the J48 algorithm. Considering the time spent doing classification in the Weka program, the IBk algorithm performed best for the first method and worst for the second method. The J48 algorithm worked longer than the other algorithms except for the IBk. Furthermore, the IBk algorithm performed the best classification after the J48 algorithm. Looking at the FP Rate, the BayesNet algorithm gave the best results.

Data Analysis for Attacked State
After applying six different attack types separately, the data obtained from the situations were combined in one place. According to this dataset, it has been determined whether the attack belongs to which type. The results obtained are given in Table 5. According to Table 5, six different algorithms were applied to classify the attacks and 2 different test methods were applied for each. For both methods used in the analysis, the IBk algorithm had a higher accuracy rate than other algorithms. Looking at the time taken for analysis, different lengths were observed. AdaBoost algorithm obtained the lowest accuracy value for both methods. The running time of this algorithm is close to the average. When the result of the success metrics in Table 6 were examined, the highest rates were obtained for Precision, Recall, F-Measure, and TP Rate with the IBk algorithm.
Considering the time the algorithms work for analysis, the algorithms that finish classification in the shortest time are IBk and Naive Bayes. The longest analysis is made by the Percentage-Split method of the IBk algorithm. BayesNet for FP Rate and J48 for ROC Area gave the best results. In addition, the J48 algorithm made the best classification after the IBk algorithm.

Discussion
Studies that have been done before and examined in the introduction are placed in Table 7. These studies and our own work are compared in this section. Analysis results are handled using the Accuracy metric, which is frequently used in the literature. According to the results given in Table 7, different methods were used to detect malware attacks on Windows systems. Unlike the studies examined, in our study, six different algorithms were used together with two different test methods. As a result of the analysis for attack detection, the highest success rates were obtained with the IBk and J48 algorithms. In our study, there are two stages for attack detection. In the first stage, a higher attack detection success rate was achieved with the J48 algorithm than the studies in the literature. In the second stage, the highest success value was obtained with the IBk algorithm in detecting the attack type, but a lower rate was achieved compared to the literature. In our study, contributions are made to the literature by preparing a testbed, obtaining a two-stage dataset, and providing high performance in malware detection processes.

CONCLUSION
In information technologies, malware attacks on Windows systems can cause serious problems. In order to prevent possible damage, it has become necessary to provide cyber security for Windows systems. For this, first of all, it is necessary to determine the attack types and to detect the attack accordingly. There are many studies carried out for this purpose and using different methods. In order to contribute to the literature in this field and to give a different perspective, a study on intrusion detection has been carried out.
In addition to the studies in the literature, a special testbed was prepared and named as AyEs.

DECLARATION OF ETHICAL STANDARDS
The authors of this article declare that the materials and methods used in their studies do not require ethical committee approval and legal-specific permission.