Research Article
BibTex RIS Cite

An Investigation of Data Mining Classification Methods in Classifying Students According to 2018 PISA Reading Scores

Year 2022, Volume: 9 Issue: 4, 867 - 882, 22.12.2022
https://doi.org/10.21449/ijate.1208809

Abstract

The purpose of this research was to determine classification accuracy of the factors affecting the success of students' reading skills based on PISA 2018 data by using Artificial Neural Networks, Decision Trees, K-Nearest Neighbor, and Naive Bayes data mining classification methods and to examine the general characteristics of success groups. In the research, 6890 student surveys of PISA 2018 were used. Firstly, missing data were examined and completed. Secondly, 24 index variables thought to affect the success of students' reading skills were determined by examining the related literature, PISA 2018 Technical Report, and PISA 2018 data. Thirdly, considering the sub-classification problem, the students were scaled in two categories as “Successful” and “Unsuccessful” according to the scores of PISA 2018 reading skills achievement test. Statistical analysis was conducted with SPSS MODELER program. At the end of the research, it was determined that Decision Trees C5.0 algorithm had the highest classification rate with 89.6%, the QUEST algorithm had the lowest classification rate with 75%, and four clusters were obtained proportionally close to each other in Two-Step Clustering analysis method to examine the general characteristics according to the success scores. It can be said that the data sets are suitable for clustering since the Silhouette Coefficient, which is calculated as 0.1 in clustering analyses, is greater than 0. It can be concluded that according to achievement scores, all data mining methods can be used to classify students since these models make accurate classification beyond chance.

References

  • Aksoy, E. (2014). Determination of the mathematically gifted and talented students using data mining in terms of some variables [Master Thesis] Dokuz Eylül University Department of Educational Sciences, İzmir.
  • Anıl, D. (2008). The analysis of factors affecting the mathematical success of Turkish students in the PISA 2006 evaluation program with structural equation modeling. American-Eurasian Journal of Scientific Research, 3(2), 222-227. Aydın, S. (2015). Data mining and an application on Anadolu University distance education system [Doctoral dissertation]. Anadolu University, Eskişehir.
  • Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Zanasi, A. (1998). Discovering data mining: from concept to implementation. Prentice-Hall, Inc.
  • Cai, Y.D., & Chou, K.C. (2003). Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochemical and Biophysical Research Communications, 305(2), 407-411. https://doi.org/10.1016/S0006-291X(03)00775-7
  • Çalış, A., Kayapınar, S., & Çetinyokuş, T. (2014). An application on computer and internet security with decision tree algorithms in data mining. Journal of Industrial Engineering, 25(3), 2-19. https://dergipark.org.tr/en/pub/endustrimuhendisligi/issue/46771/586362
  • Erdil, Z. (2010). Relationship of academic achievement and early intervention programs for children who are at socio-economical risk. Journal of Hacettepe University Faculty of Nursing, 17(1), 72-78. https://dergipark.org.tr/en/pub/hunhemsire/issue/7840/103271
  • Gelbal, S. (2010). The effect of socio-economic status of eighth grade students on their achievement in Turkish. Education and Science, 33(150). http://eb.ted.org.tr/index.php/EB/ article/view/626
  • Liu, Y., & Schumann, M. (2005). Data mining feature selection for credit scoring models. Journal of the Operational Research Society, 56(9), 1099-1108. https://doi.org/10.1057/palgrave.jors.2601976
  • Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis and data mining applications. Burlington: Academic press.
  • Özbay, Ö. (2015). The current status of distance education in the world and Turkey. The Journal of International Educational Sciences, 2(5), 376-394. https://www.academia. edu/40270671
  • Özer, Y., & Anıl, D. (2011). Examining the factors affecting students' science and mathematics achievement with the structural equation modeling. Hacettepe University - Journal of Education, 41, 313-324. https://app.trdizin.gov.tr/makale/TVRJMU1qa3lNZz09
  • Rizvi, S., Rienties, B., & Khoja, S.A. (2019). The role of demographics in online learning; A decision tree based approach. Computers & Education, 137, 32-47. https://doi.org/ 10.1016/j.compedu.2019.04.001
  • Roiger, R.J. (2017). Data mining: a tutorial-based primer. Chapman and Hall/CRC.
  • Romero, C., & Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications, 33(1), 135-146. https://doi.org/10.1016/j. eswa.2006.04.005
  • Şahin, M. (2018). Risk assessment in car insurance using decision trees and artificial neural networks [Doctoral dissertation]. Yıldız Technical University Department of Statistics, İstanbul.
  • Witten, I.H. & Frank, E. (2000). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann Publishers.
  • Xu, Y., & Goodacre, R. (2018). On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. Journal of Analysis and Testing, 2(3), 249-262.

An Investigation of Data Mining Classification Methods in Classifying Students According to 2018 PISA Reading Scores

Year 2022, Volume: 9 Issue: 4, 867 - 882, 22.12.2022
https://doi.org/10.21449/ijate.1208809

Abstract

The purpose of this research was to determine classification accuracy of the factors affecting the success of students' reading skills based on PISA 2018 data by using Artificial Neural Networks, Decision Trees, K-Nearest Neighbor, and Naive Bayes data mining classification methods and to examine the general characteristics of success groups. In the research, 6890 student surveys of PISA 2018 were used. Firstly, missing data were examined and completed. Secondly, 24 index variables thought to affect the success of students' reading skills were determined by examining the related literature, PISA 2018 Technical Report, and PISA 2018 data. Thirdly, considering the sub-classification problem, the students were scaled in two categories as “Successful” and “Unsuccessful” according to the scores of PISA 2018 reading skills achievement test. Statistical analysis was conducted with SPSS MODELER program. At the end of the research, it was determined that Decision Trees C5.0 algorithm had the highest classification rate with 89.6%, the QUEST algorithm had the lowest classification rate with 75%, and four clusters were obtained proportionally close to each other in Two-Step Clustering analysis method to examine the general characteristics according to the success scores. It can be said that the data sets are suitable for clustering since the Silhouette Coefficient, which is calculated as 0.1 in clustering analyses, is greater than 0. It can be concluded that according to achievement scores, all data mining methods can be used to classify students since these models make accurate classification beyond chance.

References

  • Aksoy, E. (2014). Determination of the mathematically gifted and talented students using data mining in terms of some variables [Master Thesis] Dokuz Eylül University Department of Educational Sciences, İzmir.
  • Anıl, D. (2008). The analysis of factors affecting the mathematical success of Turkish students in the PISA 2006 evaluation program with structural equation modeling. American-Eurasian Journal of Scientific Research, 3(2), 222-227. Aydın, S. (2015). Data mining and an application on Anadolu University distance education system [Doctoral dissertation]. Anadolu University, Eskişehir.
  • Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Zanasi, A. (1998). Discovering data mining: from concept to implementation. Prentice-Hall, Inc.
  • Cai, Y.D., & Chou, K.C. (2003). Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochemical and Biophysical Research Communications, 305(2), 407-411. https://doi.org/10.1016/S0006-291X(03)00775-7
  • Çalış, A., Kayapınar, S., & Çetinyokuş, T. (2014). An application on computer and internet security with decision tree algorithms in data mining. Journal of Industrial Engineering, 25(3), 2-19. https://dergipark.org.tr/en/pub/endustrimuhendisligi/issue/46771/586362
  • Erdil, Z. (2010). Relationship of academic achievement and early intervention programs for children who are at socio-economical risk. Journal of Hacettepe University Faculty of Nursing, 17(1), 72-78. https://dergipark.org.tr/en/pub/hunhemsire/issue/7840/103271
  • Gelbal, S. (2010). The effect of socio-economic status of eighth grade students on their achievement in Turkish. Education and Science, 33(150). http://eb.ted.org.tr/index.php/EB/ article/view/626
  • Liu, Y., & Schumann, M. (2005). Data mining feature selection for credit scoring models. Journal of the Operational Research Society, 56(9), 1099-1108. https://doi.org/10.1057/palgrave.jors.2601976
  • Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis and data mining applications. Burlington: Academic press.
  • Özbay, Ö. (2015). The current status of distance education in the world and Turkey. The Journal of International Educational Sciences, 2(5), 376-394. https://www.academia. edu/40270671
  • Özer, Y., & Anıl, D. (2011). Examining the factors affecting students' science and mathematics achievement with the structural equation modeling. Hacettepe University - Journal of Education, 41, 313-324. https://app.trdizin.gov.tr/makale/TVRJMU1qa3lNZz09
  • Rizvi, S., Rienties, B., & Khoja, S.A. (2019). The role of demographics in online learning; A decision tree based approach. Computers & Education, 137, 32-47. https://doi.org/ 10.1016/j.compedu.2019.04.001
  • Roiger, R.J. (2017). Data mining: a tutorial-based primer. Chapman and Hall/CRC.
  • Romero, C., & Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications, 33(1), 135-146. https://doi.org/10.1016/j. eswa.2006.04.005
  • Şahin, M. (2018). Risk assessment in car insurance using decision trees and artificial neural networks [Doctoral dissertation]. Yıldız Technical University Department of Statistics, İstanbul.
  • Witten, I.H. & Frank, E. (2000). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann Publishers.
  • Xu, Y., & Goodacre, R. (2018). On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. Journal of Analysis and Testing, 2(3), 249-262.
There are 17 citations in total.

Details

Primary Language English
Subjects Other Fields of Education
Journal Section Articles
Authors

Emrah Büyükatak 0000-0002-5341-5053

Duygu Anıl This is me 0000-0002-1745-4071

Publication Date December 22, 2022
Submission Date June 10, 2022
Published in Issue Year 2022 Volume: 9 Issue: 4

Cite

APA Büyükatak, E., & Anıl, D. (2022). An Investigation of Data Mining Classification Methods in Classifying Students According to 2018 PISA Reading Scores. International Journal of Assessment Tools in Education, 9(4), 867-882. https://doi.org/10.21449/ijate.1208809

23823             23825             23824