Research Article
BibTex RIS Cite

The Effect of Feature Selection Methods to Classification Performance in Health Datasets

Year 2021, Volume: 1 Issue: 1, 6 - 11, 15.04.2021

Abstract

Introduction: Nowadays, since data sets become very high-dimensional and specific with the data collected from different devices, attribute selection has an important pre-process-ing task in reducing data size in data mining. This study aims to improve classification per-formance by reducing the calculation time and cost by using attribute selection methods. Materials and Methods: Attribute selection methods are examined under three main head-ings: filter method, wrapper method and embedded method. In the study, support vector machine, Naïve Bayes and decision trees methods (J48) among the machine learning clas-sification algorithms were used. Data sets were obtained from UCI and Kaggle databases. Accuracy, sensitivity, specificity, precision and F-measure values were calculated to com-pare the classification performances of the algorithms. WEKA version 3.8.3, R3.3.0 and Tableu programs were performed in all analyzes. After unnecessary features were extract-ed by using appropriate methods in the analysis; classification performances and run times of algorithms were calculated. Results: Accuracy values increased to 87% for Colorectal Histology MNIST, 85% for Parkinson’s disease, 97% for SCADI, 100% for HCC, and 78% for breast cancer after attribute selection. The algorithm with the highest performance was found as a wrapper method with decision trees (J48). While the fastest algorithm was filter method, the longest-running algorithm was the wrapper method. According to results, the performance improvement was higher in feature sets with a large number of attributes after selecting feature. Conclusion: As a result, low-dimensional data sets may provide higher classification accuracy with lower calculation costs

References

  • [1] Deo RC. Machine learning in medicine, Circulation. 2015;132:1920-1930.doi:10.1161/Circulationaha.115.001593.
  • [2] Lin JH, Haug PJ. Data preparation framework for prepro-cessing clinical data in data mining, AMIA Annual Symposium Proceedings. American Medical Informatics Association, 2006. p. 489.
  • [3] Kohavi R, John GH. Wrappers for feature subset selection. Artificial intelligence, 1997, 97.1-2: 273-324. doi.org/10.1016/S0004-3702(97)00043-X.
  • [4] Yang J, Honavar V. Feature subset selection using a genetic al-gorithm. In Feature extraction, construction and selection. Sprin-ger, Boston, MA, 1998. p. 117-136.
  • [5] Rodriguez GV, Luque EJ, Chica OM, Mendes MP. Featu-re selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the total environment. 2018, 624: 661-672.
  • [6] D Chen DY. Pandas for everyone, Python data analysis. Addi-son-Wesley Professional, 2017. p.161.
  • [7] UCI Machine Learning Repository [Internet]. Available from: https://archive.ics.uci.edu/ml/index.php
  • [8] Open Datasets and Machine Learning Projects | Kaggle [Inter-net]. Available from: https://www.kaggle.com/datasets
  • [9] Bolón CV, Sánchez MN, Alonso BA. Feature selection for high-dimensional data. Cham, Springer International Publishing. 2015.
  • [10] Zhang L, Duan Q. A feature selection method for multi-la-bel text based on feature importance. Applied Sciences, 2019, 9.4:665. doi: 10.1007/s11042-018-6083-5.
  • [11] Kononenko I. Machine learning for medical diagnosis: his-tory, state of the art and perspective. Artificial Intelligence in medicine, 2001, 23.1: 89-109. https://doi.org/10.1016/S0933-3657(01)00077-X.
  • [12] Choubey DK, Paul S, Kumar S. Classification of Pima in-dian diabetes dataset using naive bayes with genetic algorithm as an attribute selection. In Communication and computing sys-tems: proceedings of the international conference on communi-cation and computing system (ICCCS 2016). 2017. p. 451-455. doi:10.1201/9781315364094-82.Tablo 4: Öznitelik seçim yöntemlerinin çalışma süreleri JAIHS 2021; 1(1):6-1111
  • [13] Indrayan A, Holt MP. Concise encyclopedia of biostatistics for medical professionals. Crc Press, 2016.
  • [14] Bramer M. Principles of data mining (Vol. 180). London: Springer, 2007.
  • [15] Ian AC, Bengio GY. Deep Learning Book. Deep Learn.,2015, 21(1), 111-124.
  • [16] Powers, D. M. Evaluation: from precision, recall and F-me-asure to ROC, informedness, markedness and correlation. arXiv preprint arXiv: 2010.16061, 2020.
  • [17] Patra AK, Ray R, Abdullah AA, Dash SR. Prediction of Parkinson’s disease using Ensemble Machine Learning classifi-cation from acoustic analysis. In Journal of Physics: Conferen-ce Series IOP Publishing, 2019. p. 012041. doi: 10.1088/1742-6596/1372/1/012041.
  • [18] Choudhury A, Greene CM, Classification of Functioning, Disability, and Health for Children and Youth: ICF-CY Self Care (SCADI Dataset) Using Predictive Analytics. 2021 Mar 13. [Online]. Available from: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3307719.
  • [19] Pal P, Singh B, Kaur M. Prediction of Accuracy for Hepato-cellular Carcinoma Patients using Cluster based Feature Ranking, International Journal of Medical Research and Health Sciences, 2018, 7.8: 130-140.
  • [20] Chaurasia V, Pal S. Data mining techniques: To predict and resolve breast cancer survivability. International Journal of Com-puter Science and Mobile Computing IJCSMC, 2014, 3.1: 10-22.

Sağlık Veri Setlerinde Öznitelik Seçiminin Sınıflandırma Performansına Etkisi

Year 2021, Volume: 1 Issue: 1, 6 - 11, 15.04.2021

Abstract

Giriş: Günümüzde veri setleri, farklı cihazlardan toplanan verilerle çok yüksek boyut-lu ve spesifik hale geldiğinden, öznitelik seçimi veri madenciliğinde veri boyutunu azaltmada önemli bir veri ön işleme adımıdır. Bu çalışma, öznitelik seçim yöntemlerini kullanarak makine öğrenmesi yöntemlerinin hesaplama süresini ve maliyetini düşürüp sınıflandırma performansının iyileştirilmesini amaçlamaktadır. Gereç ve Yöntem: Özni-telik seçim yöntemleri; filtreleme yöntemleri, sarmal yöntemler ve gömülü yöntemler olmak üzere üç ana başlık altında incelenmektedir. Çalışmada, makine öğrenmesi sınıf-landırma algoritmalarından destek vektör makinesi, Naïve Bayes ve karar ağaçları yön-temleri kullanılmıştır. Çalışmada kullanılan veriler UCI ve Kaggle veri tabanlarından elde edilmiştir. Algoritmaların sınıflandırma performanslarını karşılaştırmak için doğru-luk, duyarlılık, özgüllük, kesinlik ve F ölçütü değerleri hesaplanmıştır. Tüm analizlerde WEKA 3.8.3, R3.3.0 ve Tableu programları kullanılmıştır. Analizlerde uygun yöntemler kullanılarak gereksiz öznitelikler çıkarıldıktan sonra; algoritmaların sınıflandırma per-formansları ve çalışma süreleri hesaplanmıştır. Bulgular: Doğruluk değerleri, öznitelik seçiminden sonra kullanılan veri setlerinde MNIST için % 87’e, Parkinson için % 85’e, SCADI için % 97’ye, HCC için % 100’e ve meme kanseri için % 78’e yükselmiştir. En yüksek performansa sahip algoritma karar ağaçları (J48) sarmal yöntem öznitelik seçimi ile elde edilmiştir. En hızlı metot filtreleme yöntemi iken, en uzun süre çalışan algoritma sarmal yöntemdir. Bulgulara göre, çok sayıda özniteliğe sahip verilerin sınıflandırma performansları, öznitelik seçimi yapılmış verilere göre daha düşük bulunmuştur. Sonuç: Sonuç olarak; düşük boyutlu veri setleri, daha düşük hesaplama maliyetleri ile daha yüksek sınıflandırma doğruluğu sağlayabilmektedir.

References

  • [1] Deo RC. Machine learning in medicine, Circulation. 2015;132:1920-1930.doi:10.1161/Circulationaha.115.001593.
  • [2] Lin JH, Haug PJ. Data preparation framework for prepro-cessing clinical data in data mining, AMIA Annual Symposium Proceedings. American Medical Informatics Association, 2006. p. 489.
  • [3] Kohavi R, John GH. Wrappers for feature subset selection. Artificial intelligence, 1997, 97.1-2: 273-324. doi.org/10.1016/S0004-3702(97)00043-X.
  • [4] Yang J, Honavar V. Feature subset selection using a genetic al-gorithm. In Feature extraction, construction and selection. Sprin-ger, Boston, MA, 1998. p. 117-136.
  • [5] Rodriguez GV, Luque EJ, Chica OM, Mendes MP. Featu-re selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the total environment. 2018, 624: 661-672.
  • [6] D Chen DY. Pandas for everyone, Python data analysis. Addi-son-Wesley Professional, 2017. p.161.
  • [7] UCI Machine Learning Repository [Internet]. Available from: https://archive.ics.uci.edu/ml/index.php
  • [8] Open Datasets and Machine Learning Projects | Kaggle [Inter-net]. Available from: https://www.kaggle.com/datasets
  • [9] Bolón CV, Sánchez MN, Alonso BA. Feature selection for high-dimensional data. Cham, Springer International Publishing. 2015.
  • [10] Zhang L, Duan Q. A feature selection method for multi-la-bel text based on feature importance. Applied Sciences, 2019, 9.4:665. doi: 10.1007/s11042-018-6083-5.
  • [11] Kononenko I. Machine learning for medical diagnosis: his-tory, state of the art and perspective. Artificial Intelligence in medicine, 2001, 23.1: 89-109. https://doi.org/10.1016/S0933-3657(01)00077-X.
  • [12] Choubey DK, Paul S, Kumar S. Classification of Pima in-dian diabetes dataset using naive bayes with genetic algorithm as an attribute selection. In Communication and computing sys-tems: proceedings of the international conference on communi-cation and computing system (ICCCS 2016). 2017. p. 451-455. doi:10.1201/9781315364094-82.Tablo 4: Öznitelik seçim yöntemlerinin çalışma süreleri JAIHS 2021; 1(1):6-1111
  • [13] Indrayan A, Holt MP. Concise encyclopedia of biostatistics for medical professionals. Crc Press, 2016.
  • [14] Bramer M. Principles of data mining (Vol. 180). London: Springer, 2007.
  • [15] Ian AC, Bengio GY. Deep Learning Book. Deep Learn.,2015, 21(1), 111-124.
  • [16] Powers, D. M. Evaluation: from precision, recall and F-me-asure to ROC, informedness, markedness and correlation. arXiv preprint arXiv: 2010.16061, 2020.
  • [17] Patra AK, Ray R, Abdullah AA, Dash SR. Prediction of Parkinson’s disease using Ensemble Machine Learning classifi-cation from acoustic analysis. In Journal of Physics: Conferen-ce Series IOP Publishing, 2019. p. 012041. doi: 10.1088/1742-6596/1372/1/012041.
  • [18] Choudhury A, Greene CM, Classification of Functioning, Disability, and Health for Children and Youth: ICF-CY Self Care (SCADI Dataset) Using Predictive Analytics. 2021 Mar 13. [Online]. Available from: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3307719.
  • [19] Pal P, Singh B, Kaur M. Prediction of Accuracy for Hepato-cellular Carcinoma Patients using Cluster based Feature Ranking, International Journal of Medical Research and Health Sciences, 2018, 7.8: 130-140.
  • [20] Chaurasia V, Pal S. Data mining techniques: To predict and resolve breast cancer survivability. International Journal of Com-puter Science and Mobile Computing IJCSMC, 2014, 3.1: 10-22.
There are 20 citations in total.

Details

Primary Language Turkish
Subjects Artificial Intelligence (Other)
Journal Section Research Article
Authors

Mert Demirarslan 0000-0001-8848-7340

Aslı Suner 0000-0002-6872-9901

Publication Date April 15, 2021
Published in Issue Year 2021 Volume: 1 Issue: 1

Cite

Vancouver Demirarslan M, Suner A. Sağlık Veri Setlerinde Öznitelik Seçiminin Sınıflandırma Performansına Etkisi. JAIHS. 2021;1(1):6-11.