Initial Seed Value Effectiveness on Performances of Data Mining Algorithms

Tunahan Timuçin; İrem Duzdar Argun

doi:10.29130/dubited.813101

Research Article

Initial Seed Value Effectiveness on Performances of Data Mining Algorithms

Year 2021, Volume: 9 Issue: 2, 555 - 567, 25.04.2021

Tunahan Timuçin , İrem Duzdar Argun

https://doi.org/10.29130/dubited.813101

Cited By: 1

Abstract

After 2000s, Computer capacities and features are increased and access to data made easy. However, the produced and recorded data should be meaningful. Transformation of unprocessed data into meaningful information can be done with the help of data mining. In this study, classification methods from data mining applications are studied. First, the parameters that make the results of the same data set different were investigated on 4 different data mining tools (Weka, Rapid Miner, Knime, Orange), It has been tested with 3 different algorithms (K nearest neighborhood, Naive Bayes, Random Forest). In order to evaluate the performance of the data set while creating the classification models, the data set was divided into training data and test data as 80% -20%, 70% -30% and 60-40%. The accuracy, roc and precision values was used to test the performance of the classifying data. While classifying, the effect of algorithm parameters on the results is observed. The most important of these parameters is the initial seed value. The initial seed is a value using especially in classification algorithms that determines the initial placement of the data and directly affects the result. In this respect, it is very important to determine the initial seed value correctly. In this study, initial seed values between 0 and 100 were evaluated and it was shown that the classification could change the accuracy value approximately by 5%.

Keywords

Data Mining, Classification, Credit Approval, Seed Value

References

[1] M. S. Durmuş, “Veri kümeleme algoritmalarının performansları üzerine karşılaştırmalı bir çalışma,” M.S. thesis, Fen Bilimleri Enstitüsü, Pamukkale Üniversitesi, Denizli, 2005.
[2] Y.Farhang, “Face Extraction from Image based on K-Means Clustering Algorithms,” International Journal Of Advanced Computer Science And Applications, 8(9), 96-107,2017.
[3] H. Kaya, K. Köymen, “Veri madenciliği kavramı ve uygulama alanları,” Doğu Anadolu bölgesi araştırmaları Dergisi, 6(2), 159-164, 2008.
[4] Q. Chen, Y. Wan, X. Zhang, Y. Lei, J. Zobel, K. Verspoor, “Comparative analysis of sequence clustering methods for deduplication of biological databases,” Journal of Data and Information Quality (JDIQ), 9(3), 17, 2018.
[5] M. A. ALAN, “VERİ MADENCİLİĞİ VE LİSANSÜSTÜ ÖĞRENCİ VERİLERİ ÜZERİNE BİR UYGULAMA,”Dumlupinar University Journal of Social Science/Dumlupinar Üniversitesi Soysyal Bilimler Dergisi, (33), 2012.
[6] S. ÖZŞEN, R. Ceylan, “Comparison of AIS and fuzzy c-means clustering methods on the classification of breast cancer and diabetes datasets,” Turkish Journal of Electrical Engineering & Computer Sciences, 22(5), 1241-1254, 2014.
[7] G. Kayakutlu, I. Duzdar, E. Mercier-Laurent, B. Sennaroglu, “Intelligent association rules for innovative SME collaboration,” IFIP International Workshop on Artificial Intelligence for Knowledge Management, Springer, Cham, 150-164, 2014.
[8] A. M. Moawad, A. M. Gadallah, M. H. Kholief, “Fuzzy Ontology based Approach for Flexible Association Rules Mining,” Internatıonal Journal Of Advanced Computer Scıence And Applıcatıons, 8(5), 328-337, 2017.
[9] T. Pala, I. YÜCEDAĞ, H. Biberoğlu, “Association rule for classification of breast cancer patients,” Sigma, 8(2), 155-160, 2017.
[10] R. A. Shah, S. Asghar, “Privacy preserving in association rules using a genetic algorithm,” Turkish Journal of Electrical Engineering & Computer Sciences, 22(2), 434-450, 2014.
[11] I. C. Yeh, C. H. Lien, “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert Systems with Applications, 36(2), 2473-2480, 2009.
[12] A. Dhall, G. Sharma, R. Bhatt, G. M. Khan, “Adaptive digital makeup”, International Symposium on Visual Computing, pp. 728-736, Springer, Berlin, Heidelberg, 2009.
[13] E. J. Lauría, A. D. March, “Combining bayesian text classification and shrinkage to automate healthcare coding: A Data Quality Analysis,” Journal of Data and Information Quality (JDIQ), 2(3), 13, 2011.
[14] K. Rangra, K. L. Bansal, “Comparative study of data mining tools,” International journal of advanced research in computer science and software engineering, 4(6), 2014.
[15] F. BULUT, I. O. BUCAK, “An urgent precaution system to detect students at risk of substance abuse through classification algorithms,” Turkish Journal of Electrical Engineering & Computer Sciences, 22(3), 690-707, 2014.
[16] A. H. Wahbeh, Q. A. Al-Radaideh, M. N. Al-Kabi, E. M. Al-Shawakfa, “A comparison study between data mining tools over some classification methods,” International Journal of Advanced Computer Science and Applications, 8(2), 18-26, 2011.
[17] A. Tekerek, “Veri madenciliği süreçleri ve açık kaynak kodlu veri madenciliği araçları,” Akademik Bilişim, 11, 2-4, 2011.
[18] M. Dener, M. Dörterler, A. Orman, “Açık kaynak kodlu veri madenciliği programları: WEKA’da örnek uygulama,” Akademik Bilişim, 9, 11-13, 2009.
[19] E. Atagün, İ. D. Argun, “A Comparison of Data Mining Tools and Classification Algorithms: Content Producers on the Video Sharing Platform”, In The International Conference on Artificial Intelligence and Applied Mathematics in Engineering (pp. 526-538), Springer, Cham, 2019.
[20] Ş. E. Şeker, “İş zekası ve veri madenciliği,”in Cinius Yayınları, İstanbul, 2013.
[21] «WEKA,» 01.04.2020. [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/.
[22] M. Kaya, S. A. Özel, “Açık Kaynak Kodlu Veri Madenciliği Yazılımlarının Karşılaştırılması,” 14. Akademik Bilişim Konferansı, 5-7, 2014.
[23] M. Turanlı, U. H. Özden, S. Türedi, “Avrupa Birliği’ne aday ve üye ülkelerin ekonomik benzerliklerinin kümeleme analiziyle incelenmesi”. 2006. [Çevrimiçi]. Erişim Adresi: http://acikerisim.ticaret.edu.tr/xmlui/handle/11467/891#sthash.tFw7f06H.dpbs
[24] A. Tiwari, A. K. Sekhar, “Workflow based framework for life science informatics,” Computational biology and chemistry, 31(5-6), 305-319, 2007.
[25] «KNIME,» 01.04.2020. [Online]. Available: http://www.knime.org/.
[26] «ORANGE,» 01.04.2020. [Online]. Available: http://orange.biolab.si/.
[27] «RAPIDMINER,» 01.04.2020. [Online]. Available: http://www.rapidminer.com/.
[28] «Wikipedia,»08.05.2019[Online].Available: http://en.wikipedia.org/wiki/Naive_Bayes_classifier.
[29] «Wikipedia,» 08.05.2019 [Online]. Available: https://en.wikipedia.org/wiki/Randomforest
[30] «Towards Data Science,» 08.05.2019 [Online]. Available: https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd
[31] J. R. Quinlan, “Simplifying decision trees,” International journal of man-machine studies, 27(3), 221-234, 1987.
[32] L. Mason, P. L. Bartlett, J. Baxter, “ Direct optimization of margins improves generalization in combined classifiers,” Advances in neural information processing systems, pp. 288-294, 1999.
[33] D. Dua, E. Karra Taniskidou, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml], Irvine, CA: University of California, School of Information and Computer Science, 2017.
[34] A. Stetco, X. J. Zeng, J. Keane, “Fuzzy C-means++: Fuzzy C-means with effective seeding initialization,” Expert Systems with Applications, 42(21), 7541-7548, 2015.
[35] M. A. Rahman, M. Z. Islam, “Application of a density based clustering technique on biomedical datasets,” Applied Soft Computing, 73, 623-634, 2018.
[36] M. A. Mahdi, S. E. Abdelrahman, R. Bahgat, “A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm,” International Journal Of Advanced Computer Science And Applications, 9(5), 496-509, 2018.
[37] P. ELIASSON, N. Rosen, “Efficient K-means clustering and the importance of seeding,” 2013.

Veri Madenciliği Algoritmalarının Performanslarında İlk Tohum Değer Etkinliği

Year 2021, Volume: 9 Issue: 2, 555 - 567, 25.04.2021

Tunahan Timuçin , İrem Duzdar Argun

https://doi.org/10.29130/dubited.813101

Cited By: 1

Abstract

2000'li yıllardan sonra, Bilgisayar kapasiteleri ve özellikleri artmış ve verilere erişim kolaylaşmıştır. Ancak üretilen ve kaydedilen veriler anlamlı olmalıdır. İşlenmemiş verilerin anlamlı bilgilere dönüştürülmesi, veri madenciliği yardımı ile yapılabilmektedir. Bu çalışmada, veri madenciliği uygulamalarından sınıflandırma yöntemleri incelenmiştir. Öncelikle aynı veri setinin sonuçlarını farklı kılan parametreler 4 farklı veri madenciliği aracında (Weka, Rapid Miner, Knime, Orange) araştırılmış, 3 farklı algoritma ile test edilmiştir (K nearest neighborhood, Naive Bayes, Random Forest). Sınıflandırma modelleri oluşturulurken veri setinin performansını değerlendirmek için veri seti eğitim verileri ve test verileri olarak % 80-% 20, % 70-% 30 ve% 60-40 olarak ayrılmıştır. Accuracy, roc and precision değerleri, sınıflandırma verilerinin performansını test etmek için kullanılmıştır. Sınıflandırma yapılırken algoritma parametrelerinin sonuçlar üzerindeki etkisi gözlemlenmiştir. Bu parametrelerden en önemlisi ilk tohum değeridir. İlk tohum, özellikle verilerin ilk yerleşimini belirleyen ve sonucu doğrudan etkileyen sınıflandırma algoritmalarında kullanılan bir değerdir. Bu açıdan ilk tohum değerinin doğru belirlenmesi çok önemlidir. Bu çalışmada 0 ile 100 arasındaki başlangıç tohum değerleri değerlendirilmiş ve sınıflandırmanın doğruluk değerini yaklaşık %5 değiştirebileceği gösterilmiştir.

Keywords

Veri Madenciliği, Sınıflandırma, Kredi Onayı, Tohum Değeri

References

[1] M. S. Durmuş, “Veri kümeleme algoritmalarının performansları üzerine karşılaştırmalı bir çalışma,” M.S. thesis, Fen Bilimleri Enstitüsü, Pamukkale Üniversitesi, Denizli, 2005.
[2] Y.Farhang, “Face Extraction from Image based on K-Means Clustering Algorithms,” International Journal Of Advanced Computer Science And Applications, 8(9), 96-107,2017.
[3] H. Kaya, K. Köymen, “Veri madenciliği kavramı ve uygulama alanları,” Doğu Anadolu bölgesi araştırmaları Dergisi, 6(2), 159-164, 2008.
[4] Q. Chen, Y. Wan, X. Zhang, Y. Lei, J. Zobel, K. Verspoor, “Comparative analysis of sequence clustering methods for deduplication of biological databases,” Journal of Data and Information Quality (JDIQ), 9(3), 17, 2018.
[5] M. A. ALAN, “VERİ MADENCİLİĞİ VE LİSANSÜSTÜ ÖĞRENCİ VERİLERİ ÜZERİNE BİR UYGULAMA,”Dumlupinar University Journal of Social Science/Dumlupinar Üniversitesi Soysyal Bilimler Dergisi, (33), 2012.
[6] S. ÖZŞEN, R. Ceylan, “Comparison of AIS and fuzzy c-means clustering methods on the classification of breast cancer and diabetes datasets,” Turkish Journal of Electrical Engineering & Computer Sciences, 22(5), 1241-1254, 2014.
[7] G. Kayakutlu, I. Duzdar, E. Mercier-Laurent, B. Sennaroglu, “Intelligent association rules for innovative SME collaboration,” IFIP International Workshop on Artificial Intelligence for Knowledge Management, Springer, Cham, 150-164, 2014.
[8] A. M. Moawad, A. M. Gadallah, M. H. Kholief, “Fuzzy Ontology based Approach for Flexible Association Rules Mining,” Internatıonal Journal Of Advanced Computer Scıence And Applıcatıons, 8(5), 328-337, 2017.
[9] T. Pala, I. YÜCEDAĞ, H. Biberoğlu, “Association rule for classification of breast cancer patients,” Sigma, 8(2), 155-160, 2017.
[10] R. A. Shah, S. Asghar, “Privacy preserving in association rules using a genetic algorithm,” Turkish Journal of Electrical Engineering & Computer Sciences, 22(2), 434-450, 2014.
[11] I. C. Yeh, C. H. Lien, “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert Systems with Applications, 36(2), 2473-2480, 2009.
[12] A. Dhall, G. Sharma, R. Bhatt, G. M. Khan, “Adaptive digital makeup”, International Symposium on Visual Computing, pp. 728-736, Springer, Berlin, Heidelberg, 2009.
[13] E. J. Lauría, A. D. March, “Combining bayesian text classification and shrinkage to automate healthcare coding: A Data Quality Analysis,” Journal of Data and Information Quality (JDIQ), 2(3), 13, 2011.
[14] K. Rangra, K. L. Bansal, “Comparative study of data mining tools,” International journal of advanced research in computer science and software engineering, 4(6), 2014.
[15] F. BULUT, I. O. BUCAK, “An urgent precaution system to detect students at risk of substance abuse through classification algorithms,” Turkish Journal of Electrical Engineering & Computer Sciences, 22(3), 690-707, 2014.
[16] A. H. Wahbeh, Q. A. Al-Radaideh, M. N. Al-Kabi, E. M. Al-Shawakfa, “A comparison study between data mining tools over some classification methods,” International Journal of Advanced Computer Science and Applications, 8(2), 18-26, 2011.
[17] A. Tekerek, “Veri madenciliği süreçleri ve açık kaynak kodlu veri madenciliği araçları,” Akademik Bilişim, 11, 2-4, 2011.
[18] M. Dener, M. Dörterler, A. Orman, “Açık kaynak kodlu veri madenciliği programları: WEKA’da örnek uygulama,” Akademik Bilişim, 9, 11-13, 2009.
[19] E. Atagün, İ. D. Argun, “A Comparison of Data Mining Tools and Classification Algorithms: Content Producers on the Video Sharing Platform”, In The International Conference on Artificial Intelligence and Applied Mathematics in Engineering (pp. 526-538), Springer, Cham, 2019.
[20] Ş. E. Şeker, “İş zekası ve veri madenciliği,”in Cinius Yayınları, İstanbul, 2013.
[21] «WEKA,» 01.04.2020. [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/.
[22] M. Kaya, S. A. Özel, “Açık Kaynak Kodlu Veri Madenciliği Yazılımlarının Karşılaştırılması,” 14. Akademik Bilişim Konferansı, 5-7, 2014.
[23] M. Turanlı, U. H. Özden, S. Türedi, “Avrupa Birliği’ne aday ve üye ülkelerin ekonomik benzerliklerinin kümeleme analiziyle incelenmesi”. 2006. [Çevrimiçi]. Erişim Adresi: http://acikerisim.ticaret.edu.tr/xmlui/handle/11467/891#sthash.tFw7f06H.dpbs
[24] A. Tiwari, A. K. Sekhar, “Workflow based framework for life science informatics,” Computational biology and chemistry, 31(5-6), 305-319, 2007.
[25] «KNIME,» 01.04.2020. [Online]. Available: http://www.knime.org/.
[26] «ORANGE,» 01.04.2020. [Online]. Available: http://orange.biolab.si/.
[27] «RAPIDMINER,» 01.04.2020. [Online]. Available: http://www.rapidminer.com/.
[28] «Wikipedia,»08.05.2019[Online].Available: http://en.wikipedia.org/wiki/Naive_Bayes_classifier.
[29] «Wikipedia,» 08.05.2019 [Online]. Available: https://en.wikipedia.org/wiki/Randomforest
[30] «Towards Data Science,» 08.05.2019 [Online]. Available: https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd
[31] J. R. Quinlan, “Simplifying decision trees,” International journal of man-machine studies, 27(3), 221-234, 1987.
[32] L. Mason, P. L. Bartlett, J. Baxter, “ Direct optimization of margins improves generalization in combined classifiers,” Advances in neural information processing systems, pp. 288-294, 1999.
[33] D. Dua, E. Karra Taniskidou, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml], Irvine, CA: University of California, School of Information and Computer Science, 2017.
[34] A. Stetco, X. J. Zeng, J. Keane, “Fuzzy C-means++: Fuzzy C-means with effective seeding initialization,” Expert Systems with Applications, 42(21), 7541-7548, 2015.
[35] M. A. Rahman, M. Z. Islam, “Application of a density based clustering technique on biomedical datasets,” Applied Soft Computing, 73, 623-634, 2018.
[36] M. A. Mahdi, S. E. Abdelrahman, R. Bahgat, “A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm,” International Journal Of Advanced Computer Science And Applications, 9(5), 496-509, 2018.
[37] P. ELIASSON, N. Rosen, “Efficient K-means clustering and the importance of seeding,” 2013.

There are 37 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Articles
Authors	Tunahan Timuçin 0000-0003-0332-4118 İrem Duzdar Argun 0000-0002-7642-8121
Publication Date	April 25, 2021
Published in Issue	Year 2021 Volume: 9 Issue: 2

Cite

APA	Timuçin, T., & Duzdar Argun, İ. (2021). Initial Seed Value Effectiveness on Performances of Data Mining Algorithms. Duzce University Journal of Science and Technology, 9(2), 555-567. https://doi.org/10.29130/dubited.813101
AMA	Timuçin T, Duzdar Argun İ. Initial Seed Value Effectiveness on Performances of Data Mining Algorithms. DUBİTED. April 2021;9(2):555-567. doi:10.29130/dubited.813101
Chicago	Timuçin, Tunahan, and İrem Duzdar Argun. “Initial Seed Value Effectiveness on Performances of Data Mining Algorithms”. Duzce University Journal of Science and Technology 9, no. 2 (April 2021): 555-67. https://doi.org/10.29130/dubited.813101.
EndNote	Timuçin T, Duzdar Argun İ (April 1, 2021) Initial Seed Value Effectiveness on Performances of Data Mining Algorithms. Duzce University Journal of Science and Technology 9 2 555–567.
IEEE	T. Timuçin and İ. Duzdar Argun, “Initial Seed Value Effectiveness on Performances of Data Mining Algorithms”, DUBİTED, vol. 9, no. 2, pp. 555–567, 2021, doi: 10.29130/dubited.813101.
ISNAD	Timuçin, Tunahan - Duzdar Argun, İrem. “Initial Seed Value Effectiveness on Performances of Data Mining Algorithms”. Duzce University Journal of Science and Technology 9/2 (April 2021), 555-567. https://doi.org/10.29130/dubited.813101.
JAMA	Timuçin T, Duzdar Argun İ. Initial Seed Value Effectiveness on Performances of Data Mining Algorithms. DUBİTED. 2021;9:555–567.
MLA	Timuçin, Tunahan and İrem Duzdar Argun. “Initial Seed Value Effectiveness on Performances of Data Mining Algorithms”. Duzce University Journal of Science and Technology, vol. 9, no. 2, 2021, pp. 555-67, doi:10.29130/dubited.813101.
Vancouver	Timuçin T, Duzdar Argun İ. Initial Seed Value Effectiveness on Performances of Data Mining Algorithms. DUBİTED. 2021;9(2):555-67.

Cited By

Topluluk Öğrenme ile Google Uygulamalarının İçerik Derecelendirmelerini Analiz Etme

El-Cezeri Fen ve Mühendislik Dergisi

https://doi.org/10.31202/ecjse.1059822

Download Cover Image

Article Files

Full Text