Educational Data Mining: Construction of a Tree-based Model to Predict Students’ Performance

Furkan Aydın

doi:10.14686/buefad.1390209

Research Article

Educational Data Mining: Construction of a Tree-based Model to Predict Students’ Performance

Year 2025, Volume: 14 Issue: 1, 181 - 195, 29.01.2025

Furkan Aydın

https://doi.org/10.14686/buefad.1390209

Abstract

Educational data mining is a research field that probes undercover patterns in educational data. In this paper, machine learning algorithms have been applied to the dataset that consists of major features so as to predict students’ final grade performances. Thus, the most significant features and the highest-performance machine learning algorithm have been also tried to be detected. To this end, univariate feature selection, tree-based feature selection, and L1-based feature selection methods have been used for the feature selection process. Classification and regression trees, k-nearest neighbors, naive Bayes, random forest, and support vector machines have been employed to build the learning models. The L1-based feature selection and classification and regression trees have delivered the best performance for the feature selection and the model creation processes, respectively. The experimental results demonstrate that the proposed model reached a classification accuracy of 0.7700 and an F1-score of 0.7888 on average. The L1-based feature selection method has selected only 4 features: these are scholarship type, total salary, transportation to the university, and cumulative grade point average in the last semester. In consequence, there exist lots of indicators that impact students' academic successes, the success or failure that emerges after the measurement process can be estimated by regarding these features in advance. Such a task will enable the relationship mechanism between the educational inputs and outputs to be understandable and eliminate shortcomings concerning the education process.

Keywords

Academic performance , academic achievement , artificial intelligence , educational data mining , feature selection

References

Acar, E. (2022). Comparison of the Performances of OECD Countries in the Perspective of Socio-Economic Global Indices: CRITIC-Based Cocoso Method. Dumlupınar Üniversitesi Sosyal Bilimler Dergisi, 73, 256–277. https://doi.org/10.51290/dpusbe.1122650
Akdamar, E., & Kızılkaya, Y. M. (2022). Üniversite Öğrencilerinin Akademik Erteleme Eğilimleri ile Umutsuzluk Seviyeleri ve Akademik Başarıları Arasındaki İlişkinin İncelenmesi. Kahramanmaraş Sütçü İmam Üniversitesi Sosyal Bilimler Dergisi, 19(1), 212–221. https://doi.org/10.33437/ksusbd.844605
Aslanargun, E., Bozkurt, S., & Sarıoğlu, S. (2016). Sosyo Ekonomik Değişkenlerin Öğrencilerin Akademik Başarısı Üzerine Etkileri. Uşak Üniversitesi Sosyal Bilimler Dergisi, 9(27/3), 201–234.
Aziz, Y., & Memon, K. H. (2023). Fast geometrical extraction of nearest neighbors from multi-dimensional data. Pattern Recognition, 136, 109183. https://doi.org/10.1016/j.patcog.2022.109183
Baker, Ryan S. (2014). Educational Data Mining: An Advance for Intelligent Systems in Education. IEEE Intelligent Systems, 29(3), 78–82. https://doi.org/10.1109/MIS.2014.42
Baker, Ryan Shaun, & Inventado, P. S. (2014). Educational Data Mining and Learning Analytics. In Learning Analytics (pp. 61–75). Springer New York. https://doi.org/10.1007/978-1-4614-3305-7_4
Baudat, G., & Anouar, F. (2000). Generalized Discriminant Analysis Using a Kernel Approach. Neural Computation, 12(10), 2385–2404. https://doi.org/10.1162/089976600300014980
Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A. F., Wilson, J. F., Agakov, F., Navarro, P., & Haley, C. S. (2015). Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports, 5(1), 10312. https://doi.org/10.1038/srep10312
Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When Is “‘Nearest Neighbor’” Meaningful? ICDT ’99 Proceedings of the 7th International Conference on Database Theory, 217–235.
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 34(3), 483–519. https://doi.org/10.1007/s10115-012-0487-8
Breiman, L., Friedman, J. H., Stone, C. J., & Olshen, R. A. (1984). Classification and Regression Trees (1st ed.). Chapman and Hall/CRC.
Burges, C. C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. https://doi.org/10.1023/A:1009715923555
Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel based Learning Methods. Cambridge University Press.
Ghosh, D., & Cabrera, J. (2022). Enriched Random Forest for High Dimensional Genomic Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(5), 2817–2828. https://doi.org/10.1109/TCBB.2021.3089417
Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques (2nd ed.). Morgan Kaufmann.
Hechenbichler, K., & Schliep, K. (2004). Weighted k-Nearest-Neighbor Techniques and Ordinal Classification. Collaborative Research Center 386, 399. https://doi.org/10.5282/ubm/epub.1769
Ismail, L., Materwala, H., & Hennebelle, A. (2021). Comparative Analysis of Machine Learning Models for Students’ Performance Prediction. In Advances in Intelligent Systems and Computing (pp. 149–160). https://doi.org/10.1007/978-3-030-71782-7_14
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 103). Springer New York. https://doi.org/10.1007/978-1-4614-7138-7
Kazak, E. (2021). Farklı Sosyo Ekonomik Çevrelerde Bulunan Okulların Etkililiğine İlişkin Öğretmenlerin Görüşleri. Abant İzzet Baysal Üniversitesi Eğitim Fakültesi Dergisi, 21(1), 139–161. https://doi.org/10.17240/aibuefd.2021.21.60703-829153
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. https://doi.org/10.1016/S0004-3702(97)00043-X
Lee, N., & Kim, J.-M. (2010). Conversion of categorical variables into numerical variables via Bayesian network classifiers for binary classifications. Computational Statistics & Data Analysis, 54(5), 1247–1265. https://doi.org/10.1016/j.csda.2009.11.003
Lenat, D. B., & Feigenbaum, E. A. (1991). On the thresholds of knowledge. Artificial Intelligence, 47(1–3), 185–250. https://doi.org/10.1016/0004-3702(91)90055-O
Lin, H.-T., Lin, C.-J., & Weng, R. C. (2007). A note on Platt’s probabilistic outputs for support vector machines. Machine Learning, 68(3), 267–276. https://doi.org/10.1007/s10994-007-5018-6
Lockhart, R., Taylor, J., Tibshirani, R. J., & Tibshirani, R. (2014). A significance test for the lasso. The Annals of Statistics, 42(2). https://doi.org/10.1214/13-AOS1175
Manning, C. D., & Raghavan, P. (2009). An Introduction to Information Retrieval. In Online (p. 1). https://doi.org/10.1109/LPT.2009.2020494
Nisbet, R., Miner, G., & Yale, K. (2018). Data Understanding and Preparation. In Handbook of Statistical Analysis and Data Mining Applications (pp. 55–82). Elsevier. https://doi.org/10.1016/B978-0-12-416632-5.00004-9
Özdemir, A., Saylam, R., & Bilen, B. B. (2018). Eğitim Sisteminde Veri Madenciliği Uygulamaları Ve Farkındalık Üzerine Bir Durum Çalışması. Atatürk Üniversitesi Sosyal Bilimler Enstitüsü Dergisi, 22(Özel Sayı 2), 2159–2172.
Özkan, Ö. (2015). Veri Madenciliği Kavramı ve Eğitimde Veri Madenciliği Uygulamaları. Uluslararası Eğitim Bilimleri Dergisi, 5, 262–272.
Pallathadka, H., Wenda, A., Ramirez-Asís, E., Asís-López, M., Flores-Albornoz, J., & Phasinam, K. (2023). Classification and prediction of student performance data using various machine learning algorithms. Materials Today: Proceedings, 80, 3782–3785. https://doi.org/10.1016/j.matpr.2021.07.382
Russell, S. J., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach (3rd ed.). Prentice Hall.
Sa, C. L., Abang Ibrahim, D. H. bt., Dahliana Hossain, E., & bin Hossin, M. (2014). Student performance analysis system (SPAS). The 5th International Conference on Information and Communication Technology for The Muslim World (ICT4M), 1–6. https://doi.org/10.1109/ICT4M.2014.7020662
Şahin, M., & Demirtaş, H. (2014). Üniversitelerde Yabancı Uyruklu Öğrencilerin Akademik Başarı Düzeyleri, Yaşadıkları Sorunlar ve Çözüm Önerileri. Milli Eğitim Dergisi, 44(204), 88–113.
Salah Hashim, A., Akeel Awadh, W., & Khalaf Hamoud, A. (2020). Student Performance Prediction Model based on Supervised Machine Learning Algorithms. IOP Conference Series: Materials Science and Engineering, 928(3), 032019. https://doi.org/10.1088/1757-899X/928/3/032019
Sarıer, Y. (2020). TIMSS Uygulamalarında Türkiye’nin Performansı ve Akademik Başarıyı Yordayan Değişkenler. Temel Eğitim, 2(2), 6–27.
Shanmugarajeshwari, V., & Lawrance, R. (2016). Analysis of students’ performance evaluation using classification techniques. 2016 International Conference on Computing Technologies and Intelligent Data Engineering (ICCTIDE’16), 1–7. https://doi.org/10.1109/ICCTIDE.2016.7725375
Vapnik, V., & Lerner, A. (1963). Pattern recognition using generalized portrait method. Automation and Remote Control, 24, 774–780.
Yıldırım, H. İ. (2020). The Effect of Using Out-of-School Learning Environments in Science Teaching on Motivation for Learning Science. Participatory Educational Research, 7(1), 143–161. https://doi.org/10.17275/per.20.9.7.1
Yılmaz, N., & Sekeroglu, B. (2020). Student Performance Classification Using Artificial Intelligence Techniques. In R. A. Aliev, J. Kacprzyk, W. Pedrycz, M. Jamshidi, M. B. Babanli, & F. M. Sadikoglu (Eds.), 10th International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions - ICSCCW-2019 (pp. 596–603). Springer, Cham. https://doi.org/10.1007/978-3-030-35249-3_76
Yüksel, M. (2022). PISA 2018 Araştırma Sonuçlarına Göre Ülkelerin Bileşik PISA Performans Sıralaması. Muğla Sıtkı Koçman Üniversitesi Eğitim Fakültesi Dergisi, 9(2), 788–821. https://doi.org/10.21666/muefd.1093574

Eğitimsel Veri Madenciliği: Öğrencilerin Performansını Tahmin Etmek İçin Ağaç Tabanlı Bir Modelin İnşası

Year 2025, Volume: 14 Issue: 1, 181 - 195, 29.01.2025

Furkan Aydın

https://doi.org/10.14686/buefad.1390209

Abstract

Eğitimsel veri madenciliği, eğitim verilerindeki gizli örüntüleri keşfeden bir araştırma alanıdır. Bu çalışmada öğrencilerin final not performanslarını tahmin etmek amacıyla en temel özelliklerden oluşan bir veri setine makine öğrenmesi algoritmaları uygulanmıştır. Böylece en önemli özellikler ve en yüksek performanslı makine öğrenmesi algoritması da tespit edilmeye çalışılmıştır. Bu amaçla özellik seçim sürecinde tek değişkenli özellik seçimi, ağaç tabanlı özellik seçimi ve L1 tabanlı özellik seçimi yöntemleri kullanılmıştır. Öğrenme modellerini oluşturmak için sınıflandırma ve regresyon ağaçları, k-en yakın komşular, naive Bayes, rastgele orman ve destek vektör makineleri kullanılmıştır. L1 tabanlı özellik seçimi ve sınıflandırma ve regresyon ağaçları, sırasıyla özellik seçimi ve model oluşturma süreçlerinde en iyi performansı sağlamıştır. Deneysel sonuçlar, önerilen modelin ortalama 0,7700 sınıflandırma doğruluğuna ve 0,7888 F1 puanına ulaştığını göstermektedir. L1 tabanlı özellik seçme yönteminde yalnızca 4 özellik seçilmiştir: bunlar burs türü, toplam maaş, üniversiteye ulaşım ve son yarıyıldaki genel not ortalamasıdır. Sonuç olarak öğrencilerin akademik başarılarını etkileyen pek çok gösterge mevcut olup, ölçme süreci sonrasında ortaya çıkan başarı ya da başarısızlık, bu özellikler dikkate alınarak önceden tahmin edilebilmektedir. Böyle bir görev, eğitimsel girdi ve çıktılar arasındaki ilişki mekanizmasının anlaşılmasını sağlayacak ve eğitim sürecine ilişkin eksiklikleri ortadan kaldıracaktır.

Keywords

Akademik performans , akademik başarı , yapay zekâ , eğitimsel veri madenciliği , özellik seçimi

References

Acar, E. (2022). Comparison of the Performances of OECD Countries in the Perspective of Socio-Economic Global Indices: CRITIC-Based Cocoso Method. Dumlupınar Üniversitesi Sosyal Bilimler Dergisi, 73, 256–277. https://doi.org/10.51290/dpusbe.1122650
Akdamar, E., & Kızılkaya, Y. M. (2022). Üniversite Öğrencilerinin Akademik Erteleme Eğilimleri ile Umutsuzluk Seviyeleri ve Akademik Başarıları Arasındaki İlişkinin İncelenmesi. Kahramanmaraş Sütçü İmam Üniversitesi Sosyal Bilimler Dergisi, 19(1), 212–221. https://doi.org/10.33437/ksusbd.844605
Aslanargun, E., Bozkurt, S., & Sarıoğlu, S. (2016). Sosyo Ekonomik Değişkenlerin Öğrencilerin Akademik Başarısı Üzerine Etkileri. Uşak Üniversitesi Sosyal Bilimler Dergisi, 9(27/3), 201–234.
Aziz, Y., & Memon, K. H. (2023). Fast geometrical extraction of nearest neighbors from multi-dimensional data. Pattern Recognition, 136, 109183. https://doi.org/10.1016/j.patcog.2022.109183
Baker, Ryan S. (2014). Educational Data Mining: An Advance for Intelligent Systems in Education. IEEE Intelligent Systems, 29(3), 78–82. https://doi.org/10.1109/MIS.2014.42
Baker, Ryan Shaun, & Inventado, P. S. (2014). Educational Data Mining and Learning Analytics. In Learning Analytics (pp. 61–75). Springer New York. https://doi.org/10.1007/978-1-4614-3305-7_4
Baudat, G., & Anouar, F. (2000). Generalized Discriminant Analysis Using a Kernel Approach. Neural Computation, 12(10), 2385–2404. https://doi.org/10.1162/089976600300014980
Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A. F., Wilson, J. F., Agakov, F., Navarro, P., & Haley, C. S. (2015). Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports, 5(1), 10312. https://doi.org/10.1038/srep10312
Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When Is “‘Nearest Neighbor’” Meaningful? ICDT ’99 Proceedings of the 7th International Conference on Database Theory, 217–235.
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 34(3), 483–519. https://doi.org/10.1007/s10115-012-0487-8
Breiman, L., Friedman, J. H., Stone, C. J., & Olshen, R. A. (1984). Classification and Regression Trees (1st ed.). Chapman and Hall/CRC.
Burges, C. C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. https://doi.org/10.1023/A:1009715923555
Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel based Learning Methods. Cambridge University Press.
Ghosh, D., & Cabrera, J. (2022). Enriched Random Forest for High Dimensional Genomic Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(5), 2817–2828. https://doi.org/10.1109/TCBB.2021.3089417
Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques (2nd ed.). Morgan Kaufmann.
Hechenbichler, K., & Schliep, K. (2004). Weighted k-Nearest-Neighbor Techniques and Ordinal Classification. Collaborative Research Center 386, 399. https://doi.org/10.5282/ubm/epub.1769
Ismail, L., Materwala, H., & Hennebelle, A. (2021). Comparative Analysis of Machine Learning Models for Students’ Performance Prediction. In Advances in Intelligent Systems and Computing (pp. 149–160). https://doi.org/10.1007/978-3-030-71782-7_14
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 103). Springer New York. https://doi.org/10.1007/978-1-4614-7138-7
Kazak, E. (2021). Farklı Sosyo Ekonomik Çevrelerde Bulunan Okulların Etkililiğine İlişkin Öğretmenlerin Görüşleri. Abant İzzet Baysal Üniversitesi Eğitim Fakültesi Dergisi, 21(1), 139–161. https://doi.org/10.17240/aibuefd.2021.21.60703-829153
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. https://doi.org/10.1016/S0004-3702(97)00043-X
Lee, N., & Kim, J.-M. (2010). Conversion of categorical variables into numerical variables via Bayesian network classifiers for binary classifications. Computational Statistics & Data Analysis, 54(5), 1247–1265. https://doi.org/10.1016/j.csda.2009.11.003
Lenat, D. B., & Feigenbaum, E. A. (1991). On the thresholds of knowledge. Artificial Intelligence, 47(1–3), 185–250. https://doi.org/10.1016/0004-3702(91)90055-O
Lin, H.-T., Lin, C.-J., & Weng, R. C. (2007). A note on Platt’s probabilistic outputs for support vector machines. Machine Learning, 68(3), 267–276. https://doi.org/10.1007/s10994-007-5018-6
Lockhart, R., Taylor, J., Tibshirani, R. J., & Tibshirani, R. (2014). A significance test for the lasso. The Annals of Statistics, 42(2). https://doi.org/10.1214/13-AOS1175
Manning, C. D., & Raghavan, P. (2009). An Introduction to Information Retrieval. In Online (p. 1). https://doi.org/10.1109/LPT.2009.2020494
Nisbet, R., Miner, G., & Yale, K. (2018). Data Understanding and Preparation. In Handbook of Statistical Analysis and Data Mining Applications (pp. 55–82). Elsevier. https://doi.org/10.1016/B978-0-12-416632-5.00004-9
Özdemir, A., Saylam, R., & Bilen, B. B. (2018). Eğitim Sisteminde Veri Madenciliği Uygulamaları Ve Farkındalık Üzerine Bir Durum Çalışması. Atatürk Üniversitesi Sosyal Bilimler Enstitüsü Dergisi, 22(Özel Sayı 2), 2159–2172.
Özkan, Ö. (2015). Veri Madenciliği Kavramı ve Eğitimde Veri Madenciliği Uygulamaları. Uluslararası Eğitim Bilimleri Dergisi, 5, 262–272.
Pallathadka, H., Wenda, A., Ramirez-Asís, E., Asís-López, M., Flores-Albornoz, J., & Phasinam, K. (2023). Classification and prediction of student performance data using various machine learning algorithms. Materials Today: Proceedings, 80, 3782–3785. https://doi.org/10.1016/j.matpr.2021.07.382
Russell, S. J., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach (3rd ed.). Prentice Hall.
Sa, C. L., Abang Ibrahim, D. H. bt., Dahliana Hossain, E., & bin Hossin, M. (2014). Student performance analysis system (SPAS). The 5th International Conference on Information and Communication Technology for The Muslim World (ICT4M), 1–6. https://doi.org/10.1109/ICT4M.2014.7020662
Şahin, M., & Demirtaş, H. (2014). Üniversitelerde Yabancı Uyruklu Öğrencilerin Akademik Başarı Düzeyleri, Yaşadıkları Sorunlar ve Çözüm Önerileri. Milli Eğitim Dergisi, 44(204), 88–113.
Salah Hashim, A., Akeel Awadh, W., & Khalaf Hamoud, A. (2020). Student Performance Prediction Model based on Supervised Machine Learning Algorithms. IOP Conference Series: Materials Science and Engineering, 928(3), 032019. https://doi.org/10.1088/1757-899X/928/3/032019
Sarıer, Y. (2020). TIMSS Uygulamalarında Türkiye’nin Performansı ve Akademik Başarıyı Yordayan Değişkenler. Temel Eğitim, 2(2), 6–27.
Shanmugarajeshwari, V., & Lawrance, R. (2016). Analysis of students’ performance evaluation using classification techniques. 2016 International Conference on Computing Technologies and Intelligent Data Engineering (ICCTIDE’16), 1–7. https://doi.org/10.1109/ICCTIDE.2016.7725375
Vapnik, V., & Lerner, A. (1963). Pattern recognition using generalized portrait method. Automation and Remote Control, 24, 774–780.
Yıldırım, H. İ. (2020). The Effect of Using Out-of-School Learning Environments in Science Teaching on Motivation for Learning Science. Participatory Educational Research, 7(1), 143–161. https://doi.org/10.17275/per.20.9.7.1
Yılmaz, N., & Sekeroglu, B. (2020). Student Performance Classification Using Artificial Intelligence Techniques. In R. A. Aliev, J. Kacprzyk, W. Pedrycz, M. Jamshidi, M. B. Babanli, & F. M. Sadikoglu (Eds.), 10th International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions - ICSCCW-2019 (pp. 596–603). Springer, Cham. https://doi.org/10.1007/978-3-030-35249-3_76
Yüksel, M. (2022). PISA 2018 Araştırma Sonuçlarına Göre Ülkelerin Bileşik PISA Performans Sıralaması. Muğla Sıtkı Koçman Üniversitesi Eğitim Fakültesi Dergisi, 9(2), 788–821. https://doi.org/10.21666/muefd.1093574

There are 39 citations in total.

Details

Primary Language	English
Subjects	Higher Education Studies (Other)
Journal Section	Research Article
Authors	Furkan Aydın 0000-0003-0610-8744
Publication Date	January 29, 2025
Submission Date	November 13, 2023
Acceptance Date	October 13, 2024
Published in Issue	Year 2025 Volume: 14 Issue: 1

Cite

APA	Aydın, F. (2025). Educational Data Mining: Construction of a Tree-based Model to Predict Students’ Performance. Bartın University Journal of Faculty of Education, 14(1), 181-195. https://doi.org/10.14686/buefad.1390209

Download Cover Image

Article Files

Full Text

All the articles published in the journal are open access and distributed under the conditions of CommonsAttribution-NonCommercial 4.0 International License

Bartın University Journal of Faculty of Education