An Investigation of Data Mining Classification Methods in Classifying Students According to 2018 PISA Reading Scores

Emrah Büyükatak; Duygu Anıl

doi:10.21449/ijate.1208809

Research Article

An Investigation of Data Mining Classification Methods in Classifying Students According to 2018 PISA Reading Scores

Year 2022, Volume: 9 Issue: 4, 867 - 882, 22.12.2022

Emrah Büyükatak , Duygu Anıl

https://doi.org/10.21449/ijate.1208809

Abstract

The purpose of this research was to determine classification accuracy of the factors affecting the success of students' reading skills based on PISA 2018 data by using Artificial Neural Networks, Decision Trees, K-Nearest Neighbor, and Naive Bayes data mining classification methods and to examine the general characteristics of success groups. In the research, 6890 student surveys of PISA 2018 were used. Firstly, missing data were examined and completed. Secondly, 24 index variables thought to affect the success of students' reading skills were determined by examining the related literature, PISA 2018 Technical Report, and PISA 2018 data. Thirdly, considering the sub-classification problem, the students were scaled in two categories as “Successful” and “Unsuccessful” according to the scores of PISA 2018 reading skills achievement test. Statistical analysis was conducted with SPSS MODELER program. At the end of the research, it was determined that Decision Trees C5.0 algorithm had the highest classification rate with 89.6%, the QUEST algorithm had the lowest classification rate with 75%, and four clusters were obtained proportionally close to each other in Two-Step Clustering analysis method to examine the general characteristics according to the success scores. It can be said that the data sets are suitable for clustering since the Silhouette Coefficient, which is calculated as 0.1 in clustering analyses, is greater than 0. It can be concluded that according to achievement scores, all data mining methods can be used to classify students since these models make accurate classification beyond chance.

Keywords

Data Mining, Artificial Neural Networks, Decision Trees, Cluster Analysis, Classification

References

Aksoy, E. (2014). Determination of the mathematically gifted and talented students using data mining in terms of some variables [Master Thesis] Dokuz Eylül University Department of Educational Sciences, İzmir.
Anıl, D. (2008). The analysis of factors affecting the mathematical success of Turkish students in the PISA 2006 evaluation program with structural equation modeling. American-Eurasian Journal of Scientific Research, 3(2), 222-227. Aydın, S. (2015). Data mining and an application on Anadolu University distance education system [Doctoral dissertation]. Anadolu University, Eskişehir.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Zanasi, A. (1998). Discovering data mining: from concept to implementation. Prentice-Hall, Inc.
Cai, Y.D., & Chou, K.C. (2003). Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochemical and Biophysical Research Communications, 305(2), 407-411. https://doi.org/10.1016/S0006-291X(03)00775-7
Çalış, A., Kayapınar, S., & Çetinyokuş, T. (2014). An application on computer and internet security with decision tree algorithms in data mining. Journal of Industrial Engineering, 25(3), 2-19. https://dergipark.org.tr/en/pub/endustrimuhendisligi/issue/46771/586362
Erdil, Z. (2010). Relationship of academic achievement and early intervention programs for children who are at socio-economical risk. Journal of Hacettepe University Faculty of Nursing, 17(1), 72-78. https://dergipark.org.tr/en/pub/hunhemsire/issue/7840/103271
Gelbal, S. (2010). The effect of socio-economic status of eighth grade students on their achievement in Turkish. Education and Science, 33(150). http://eb.ted.org.tr/index.php/EB/ article/view/626
Liu, Y., & Schumann, M. (2005). Data mining feature selection for credit scoring models. Journal of the Operational Research Society, 56(9), 1099-1108. https://doi.org/10.1057/palgrave.jors.2601976
Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis and data mining applications. Burlington: Academic press.
Özbay, Ö. (2015). The current status of distance education in the world and Turkey. The Journal of International Educational Sciences, 2(5), 376-394. https://www.academia. edu/40270671
Özer, Y., & Anıl, D. (2011). Examining the factors affecting students' science and mathematics achievement with the structural equation modeling. Hacettepe University - Journal of Education, 41, 313-324. https://app.trdizin.gov.tr/makale/TVRJMU1qa3lNZz09
Rizvi, S., Rienties, B., & Khoja, S.A. (2019). The role of demographics in online learning; A decision tree based approach. Computers & Education, 137, 32-47. https://doi.org/ 10.1016/j.compedu.2019.04.001
Roiger, R.J. (2017). Data mining: a tutorial-based primer. Chapman and Hall/CRC.
Romero, C., & Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications, 33(1), 135-146. https://doi.org/10.1016/j. eswa.2006.04.005
Şahin, M. (2018). Risk assessment in car insurance using decision trees and artificial neural networks [Doctoral dissertation]. Yıldız Technical University Department of Statistics, İstanbul.
Witten, I.H. & Frank, E. (2000). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann Publishers.
Xu, Y., & Goodacre, R. (2018). On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. Journal of Analysis and Testing, 2(3), 249-262.

An Investigation of Data Mining Classification Methods in Classifying Students According to 2018 PISA Reading Scores

Year 2022, Volume: 9 Issue: 4, 867 - 882, 22.12.2022

Emrah Büyükatak , Duygu Anıl

https://doi.org/10.21449/ijate.1208809

Abstract

The purpose of this research was to determine classification accuracy of the factors affecting the success of students' reading skills based on PISA 2018 data by using Artificial Neural Networks, Decision Trees, K-Nearest Neighbor, and Naive Bayes data mining classification methods and to examine the general characteristics of success groups. In the research, 6890 student surveys of PISA 2018 were used. Firstly, missing data were examined and completed. Secondly, 24 index variables thought to affect the success of students' reading skills were determined by examining the related literature, PISA 2018 Technical Report, and PISA 2018 data. Thirdly, considering the sub-classification problem, the students were scaled in two categories as “Successful” and “Unsuccessful” according to the scores of PISA 2018 reading skills achievement test. Statistical analysis was conducted with SPSS MODELER program. At the end of the research, it was determined that Decision Trees C5.0 algorithm had the highest classification rate with 89.6%, the QUEST algorithm had the lowest classification rate with 75%, and four clusters were obtained proportionally close to each other in Two-Step Clustering analysis method to examine the general characteristics according to the success scores. It can be said that the data sets are suitable for clustering since the Silhouette Coefficient, which is calculated as 0.1 in clustering analyses, is greater than 0. It can be concluded that according to achievement scores, all data mining methods can be used to classify students since these models make accurate classification beyond chance.

Keywords

Data Mining, Artificial Neural Networks, Decision Trees, Cluster Analysis, Classification

References

Aksoy, E. (2014). Determination of the mathematically gifted and talented students using data mining in terms of some variables [Master Thesis] Dokuz Eylül University Department of Educational Sciences, İzmir.
Anıl, D. (2008). The analysis of factors affecting the mathematical success of Turkish students in the PISA 2006 evaluation program with structural equation modeling. American-Eurasian Journal of Scientific Research, 3(2), 222-227. Aydın, S. (2015). Data mining and an application on Anadolu University distance education system [Doctoral dissertation]. Anadolu University, Eskişehir.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Zanasi, A. (1998). Discovering data mining: from concept to implementation. Prentice-Hall, Inc.
Cai, Y.D., & Chou, K.C. (2003). Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochemical and Biophysical Research Communications, 305(2), 407-411. https://doi.org/10.1016/S0006-291X(03)00775-7
Çalış, A., Kayapınar, S., & Çetinyokuş, T. (2014). An application on computer and internet security with decision tree algorithms in data mining. Journal of Industrial Engineering, 25(3), 2-19. https://dergipark.org.tr/en/pub/endustrimuhendisligi/issue/46771/586362
Erdil, Z. (2010). Relationship of academic achievement and early intervention programs for children who are at socio-economical risk. Journal of Hacettepe University Faculty of Nursing, 17(1), 72-78. https://dergipark.org.tr/en/pub/hunhemsire/issue/7840/103271
Gelbal, S. (2010). The effect of socio-economic status of eighth grade students on their achievement in Turkish. Education and Science, 33(150). http://eb.ted.org.tr/index.php/EB/ article/view/626
Liu, Y., & Schumann, M. (2005). Data mining feature selection for credit scoring models. Journal of the Operational Research Society, 56(9), 1099-1108. https://doi.org/10.1057/palgrave.jors.2601976
Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis and data mining applications. Burlington: Academic press.
Özbay, Ö. (2015). The current status of distance education in the world and Turkey. The Journal of International Educational Sciences, 2(5), 376-394. https://www.academia. edu/40270671
Özer, Y., & Anıl, D. (2011). Examining the factors affecting students' science and mathematics achievement with the structural equation modeling. Hacettepe University - Journal of Education, 41, 313-324. https://app.trdizin.gov.tr/makale/TVRJMU1qa3lNZz09
Rizvi, S., Rienties, B., & Khoja, S.A. (2019). The role of demographics in online learning; A decision tree based approach. Computers & Education, 137, 32-47. https://doi.org/ 10.1016/j.compedu.2019.04.001
Roiger, R.J. (2017). Data mining: a tutorial-based primer. Chapman and Hall/CRC.
Romero, C., & Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications, 33(1), 135-146. https://doi.org/10.1016/j. eswa.2006.04.005
Şahin, M. (2018). Risk assessment in car insurance using decision trees and artificial neural networks [Doctoral dissertation]. Yıldız Technical University Department of Statistics, İstanbul.
Witten, I.H. & Frank, E. (2000). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann Publishers.
Xu, Y., & Goodacre, R. (2018). On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. Journal of Analysis and Testing, 2(3), 249-262.

There are 17 citations in total.

Details

Primary Language	English
Subjects	Other Fields of Education
Journal Section	Articles
Authors	Emrah Büyükatak 0000-0002-5341-5053 Duygu Anıl This is me 0000-0002-1745-4071
Publication Date	December 22, 2022
Submission Date	June 10, 2022
Published in Issue	Year 2022 Volume: 9 Issue: 4

Cite

APA	Büyükatak, E., & Anıl, D. (2022). An Investigation of Data Mining Classification Methods in Classifying Students According to 2018 PISA Reading Scores. International Journal of Assessment Tools in Education, 9(4), 867-882. https://doi.org/10.21449/ijate.1208809