The data sources of institutions, social media shares,
articles on websites and forms provide large amounts of data. It is very
difficult to process large amounts of data in traditional ways and to produce
information for use in decision processes.
In this context, data mining can provide the
production of the information needed from the available data with the advanced
techniques that it offers.
Databases are rich in confidential information that
will enable rational decision-making. Classification and estimation are two
important data analysis techniques used for estimating future data trends or
explaining important data classes. These analyzes can be useful in better
understanding of large amounts of data. Today, institutions produce large
amounts of data, but they have difficulties in revealing meaningful and useful
information within these data. It is not easy to analyze large data with traditional
statistical methods. Special methods are therefore required to process and
analyze data. Data mining methods have emerged to meet this requirement.
The aim of this study is to compare the performances
of the SMO and J48 algorithms used in the classification of data mining. For
this purpose, data mining was performed by using three different student data
sets.
Data mining is an analysis method that summarizes data
and exposes hidden relationships with both useful and understandable data, in
unusual ways. This method is one of the processes of knowledge discovery in the
database, which first explores scientific and technical data to reveal unknown
patterns. Classification is a process that is frequently used in daily life. By
classification, the objects are split and separated, that is, each of the
mutually exclusive or general categories can be assigned as a class. Many
practical decision-making processes can be formulated as a classification
problem. For example, people or objects can be one of many categories.
Classification is the process of assigning different elements in different
classes. These classes may be business rules, class boundaries, or some
mathematical functions. The classification process can be constructed on a
relationship between a class of the classified element and a known class value
and properties. This type of classification is called “supervised learning”. If
there are no known examples of a class, this classification is unsupervised.
The most common uncontrolled classification approach is clustering. The most
common applications of clustering technology are retail basket analysis and
fraud detection.
The concept of controlled learning in data mining is
to teach a classification function on the basis of known data with a classification
or to construct a classification model. This function or model converts data
from the database into target attributes, so new data can be used in class
estimation. The data mining system relates to areas such as spatial data
analysis, information retrieval, model recognition, image analysis, signal
processing, computer graphics, web technology, economics, business,
bioinformatics or psychology, depending on the types of data to be mining or
the specific data mining application.
SMO (Sequential Minimal Optimization) is a simple
algorithm that can quickly solve the SVM QP problem without any extra matrix
storage and without using numerical QP optimization steps. SMO chooses to solve
the smallest possible optimization problem at every step. The smallest possible
optimization problem for the standard SVM QP problem involves two Lagrange
multipliers because the Lagrange multipliers must comply with a linear equality
constraint. At each step, the SMO selects two Lagrange multipliers to jointly
optimize it, finds the most appropriate values for these multipliers and
updates the SVM to reflect the new optimal values. The advantage of SMO lies in
the fact that the analysis of two Lagrange multipliers can be done
analytically. Thus, numerical QP optimization is completely prevented. Although
more optimization sub-problems are solved during the algorithm, each
sub-problem is so fast that the general QP problem is solved quickly.
Furthermore, SMO does not require any additional matrix storage. Therefore, very
large SVM training problems can fit into the memory of an ordinary personal
computer or workstation. SMO is less sensitive to numerical sensitivity
problems since no matrix algorithm is used.
J48 is a decision tree algorithm based on the very
popular C4.5 algorithm developed by J. Ross Quinlan. Decision trees are a
classic way of representing information from a machine learning algorithm and
provide a powerful and fast way to express data structures. This algorithm
classifies the data recursively. This ensures the maximum accuracy of the
training data, but it can only create extreme rules that define the specific
behavior characteristics of the data. J48 Algorithm; Based on the Information
Gain Theory, it has the ability to automatically process the data to select the
relevant properties. It is the iterative algorithm that divides the samples
from the point where information gain is the best. The tree structure starts
with the process of dividing the subjects and selecting the best root variable
of the tree and building it from top to bottom. The J48 is able to perform an
effective pruning process to cut weak branches, which is not meaningful. One of
the reasons is that the purpose of decision trees is not to discover data, but
to create a simple classification model on the data.
In this study, three different data sets of university
students were used. The data were subjected to the necessary regulations using
Excel macros and data warehouses were prepared. After making the necessary
conversions, the data is printed in the text file “iibf1.arff ”, “iibf2.arff”
and “myo.arff”. In the study, the WEKA Program (Waikato Environment for
Knowledge Analysis) version 3.7.2 developed by the University of Waikato was
used. For each data set, the student's gender, province, family income level,
the number of siblings, number of siblings studying, and entry point were taken
as qualifications. The degree of entry score is used in the class definitions.
According to the data results, the success rate of the
SMO algorithm in the classification is higher compared J48 algorithm, making
this algorithm more reliable.
Amaç: Veri madenciliği disiplinler arası bir alandır, sürekli gelişmekte ve kullanım alanları yaygınlaşmaktadır. Çeşitli tekniklerin ve algoritmaların kullanılmasıyla verilerin güvenilirliğinin sağlanmasına yardımcı olmaktadır. Sınıflandırma, araştırmacılar tarafından yaygın olarak kullanıldığı için önemli bir veri madenciliği tekniğidir.
Yöntem: Bu çalışmada, üç farklı öğrenci veri seti üzerinde SMO ve J48 algoritmalarının sınıflandırma sonuçları karşılaştırılmıştır. Çalışmada, üç farklı veri seti ile TP-Oranı, FP-Oranı, Kesinlik, Duyarlık, F-ölçütü ve ROC analizi gibi çeşitli doğruluk ölçümleri kullanılarak, J48 ve SMO algoritmalarının sınıflandırma doğruluğu açısından performansı değerlendirilmiştir.
Bulgular ve Sonuç: Yapılan testler sonucunda her üç veri setinde SMO algoritmasının sınıflandırma performansının daha iyi olduğu ortaya konmuştur.
Primary Language | Turkish |
---|---|
Journal Section | Original Articles |
Authors | |
Publication Date | December 26, 2018 |
Submission Date | November 25, 2018 |
Acceptance Date | December 24, 2018 |
Published in Issue | Year 2018 Volume: 6 Issue: 3 |
This journal is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.