Research Article
BibTex RIS Cite

Karma Veriler İçin Kümeleme Yöntemlerinin Karşılaştırılması: Hipotetik Öğrenci Burs Verisi Üzerine Durum Çalışması

Year 2025, Issue: 59, 1 - 14
https://doi.org/10.33418/education.1674501

Abstract

Kümeleme, karmaşık veri setlerinde desenleri ortaya çıkarmak ve bireyleri gruplayabilmek için yaygın olarak kullanılan bir tekniktir. Özellikle akademik ve bağlamsal değişkenlerin önemli olduğu eğitim alanında sıklıkla kullanılmaktadır. Bu çalışma, R yazılımında üretilmiş hipotetik bir öğrenci bursu veri seti kullanılarak, öğrencilerin burs uygunluk gruplarına sınıflandırılmasında altı kümeleme yönteminin temellerini tanıtmayı ve performansını incelemeyi amaçlamaktadır. Veri seti, tipik burs değerlendirme ölçütlerini yansıtan iki sayısal değişken (GNO ve Bursluluk Sınavı Sonucu) ile dört kategorik değişkenden değişkenden (Maddi İhtiyaç Durumu, Çalışan Ebeveyn Sayısı, Öğrencinin Çalışma Durumu ve Konaklama Türü) oluşmaktadır. Öğrenciler "Asil Adaylar", "Yedek Adaylar" ve "Reddedilen Adaylar" olarak etiketlenmiş ve kümeleme yöntemlerininn (K-Ortalamalar, K-Modlar, K-Prototipler, Medoidlere Bölme, Gizil Sınıf Analizi ve Karma Verilerle Faktör Analizi sonrasında K-Ortalamalar) bu etiketleri ne ölçüde doğru biçimde yeniden ürettikleri açısından değerlendirilmiştir. Bulgular, özellikle K-Prototipler (%95,6) ve Medoidlere Bölme (%92,5) gibi hibrit yaklaşımların en yüksek doğruluğa ulaştığını göstermektedir. Karma Verilerle Faktör Analizi sonrasında K-Ortalamalar (%93,9) boyut indirgeme yoluyla güçlü bir alternatif sunarken, Gizil Sınıf Analizi %85,9 doğruluk sağlamıştır. Bulgular, kümeleme uygulamalarında kategorik değişkenlerin değerini ortaya koymakta ve özellikle burs seçimi gibi yüksek riskli bağlamlarda karma tipte eğitim verileri için uygun kümeleme tekniklerinin seçilmesinin önemini vurgulamaktadır.

References

  • Ahmad, A., & Khan, S. S. (2019). Survey of state-of-the-art mixed data clustering algorithms. IEEE Access, 7, 31883-31902. https://doi.org/10.1109/ACCESS.2019.2903568
  • Bektas, A., & Schumann, R. (2019, June). How to optimize Gower distance weights for the k-medoids clustering algorithm to obtain mobility profiles of the Swiss population. In 2019 6th Swiss Conference on Data Science (SDS) (pp. 51-56). IEEE. https://doi.org/10.1109/SDS.2019.000-8.
  • Costa, E., Papatsouma, I., & Markos, A. (2023). Benchmarking distance-based partitioning methods for mixed-type data. Advances in Data Analysis and Classification, 17(3), 701-724. https://doi.org/10.1007/s11634-022-00521-7
  • Dutt, A., Ismail, M. A., Herawan, T., & Targio, I. A. (2024). Partition-Based Clustering Algorithms Applied to Mixed Data for Educational Data Mining: A Survey From 1971 to 2024. IEEE Access 12, 172923- 172942. https://doi.org/10.1109/ACCESS.2024.3496929
  • Hadzi-Pavlovic, D. (2010). Finding patterns and groupings: II. Introduction to latent profile analysis and finite mixture models. Acta Neuropsychiatrica, 22(1), 40-42.https://doi.org/10.1111/j.1601-5215.2009.00442.x
  • Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate data analysis (8th ed.). Cengage Learning.
  • Hunt, L., & Jorgensen, M. (2011). Clustering mixed data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4), 352-361. https://doi.org/10.1002/widm.33
  • Kim, B. (2017). A fast K-prototypes algorithm using partial distance computation. Symmetry, 9(4), 58-68. https://doi.org/10.3390/sym9040058
  • MacQueen, J. (1967, January). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics (Vol. 5, pp. 281-298). University of California press.
  • Romero, C., & Ventura, S. (2020). Educational data mining and learning analytics: An updated survey. Wiley interdisciplinary reviews: Data mining and knowledge discovery, 10(3), 1-21. https://doi.org/10.1002/widm.1355
  • Saracco, J., & Chavent, M. (2016). Clustering of variables for mixed data. EAS Publications Series, 77, 121-169. https://doi.org/10.1051/eas/1677007
  • Sayadi, S., Geffard, E., Südholt, M., Vince, N., & Gourraud, P. A. (2021). Secure distribution of factor analysis of mixed data (FAMD) and its application to personalized medicine of transplanted patients. In L. Barolli, I. Woungang, & T. Enokido (Eds.), Advanced information networking and applications (AINA 2021) (Lecture Notes in Networks and Systems, Vol. 225, pp. 524–534). Springer. https://doi.org/10.1007/978-3-030-75100-5_44
  • Sujatha, K. (2012). Implementation of K-Modes algorithm to cluster very large categorical data sets in data mining. Data mining and knowledge engineering, 4, 481-486.
  • Schubert, E., & Rousseeuw, P. J. (2019). Faster K-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In Similarity Search and Applications: 12th International Conference, SISAP 2019, Newark, NJ, USA, October 2–4, 2019, Proceedings 12 (pp. 171-187). Springer International Publishing.
  • Van de Velden, M., Iodice D'Enza, A., & Markos, A. (2018). Distance‐based clustering of mixed data. Wiley Interdisciplinary Reviews: Computational Statistics, 11(3). https://doi.org/10.1002/wics.1456
  • Vermunt, J. K., & Magidson, J. (2002). Latent class cluster analysis. Applied latent class analysis, 11(89-106), 60.
  • Wang, Y., Yang, L., Wu, J., & Shi, L. (2022, July). Mining multi-source campus data: An empirical analysis of student portrait using clustering method. In 2022 5th International Conference on Data Science and Information Technology (DSIT) (pp. 01-06). IEEE.
  • Wu, J., Xiong, H., & Chen, J. (2009, June). Adapting the right measures for K-means clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 877-886).
  • Zhang, P., Wang, X., & Song, P. X. K. (2006). Clustering categorical data based on distance vectors. Journal of the American Statistical Association, 101(473), 355-367

Comparison of Clustering Methods for Mixed Data: A Case Study on Hypothetical Student Scholarship Data

Year 2025, Issue: 59, 1 - 14
https://doi.org/10.33418/education.1674501

Abstract

Clustering is a widely used technique for uncovering patterns and grouping individuals within complex datasets, particularly in fields like education where both academic and contextual variables are essential. This study aims to introduce the basics and explore the performance of six clustering methods in classifying students into scholarship eligibility groups using a hypothetical student scholarship dataset generated in R software. The dataset consists of two numerical variables (GPA and Scholarship Exam Result) and four categorical variables (Financial Need, Number of Parents Employed, Employment Status, and Accommodation), reflecting typical criteria in educational funding decisions. Students were labeled as Primary, Secondary, or Rejected Candidates, and the clustering methods—K-Means, K-Modes, K-Prototypes, Partitioning Around Medoids (PAM), Latent Class Analysis (LCA), and Factor Analysis for Mixed Data (FAMD) followed by K-Means—were assessed based on how accurately they reproduced these labels. Results indicate that hybrid approaches, particularly K-Prototypes (95.6%) and PAM (92.5%), achieved the highest accuracy. FAMD + K-Means (93.9%) offered a robust alternative through dimensionality reduction while LCA produced an 85.9% accuracy. The findings highlight the value of categorical variables in clustering applications, and it also demonstrates the importance of selecting suitable clustering techniques for mixed-type educational data, especially in high-stakes contexts such as scholarship selection.

References

  • Ahmad, A., & Khan, S. S. (2019). Survey of state-of-the-art mixed data clustering algorithms. IEEE Access, 7, 31883-31902. https://doi.org/10.1109/ACCESS.2019.2903568
  • Bektas, A., & Schumann, R. (2019, June). How to optimize Gower distance weights for the k-medoids clustering algorithm to obtain mobility profiles of the Swiss population. In 2019 6th Swiss Conference on Data Science (SDS) (pp. 51-56). IEEE. https://doi.org/10.1109/SDS.2019.000-8.
  • Costa, E., Papatsouma, I., & Markos, A. (2023). Benchmarking distance-based partitioning methods for mixed-type data. Advances in Data Analysis and Classification, 17(3), 701-724. https://doi.org/10.1007/s11634-022-00521-7
  • Dutt, A., Ismail, M. A., Herawan, T., & Targio, I. A. (2024). Partition-Based Clustering Algorithms Applied to Mixed Data for Educational Data Mining: A Survey From 1971 to 2024. IEEE Access 12, 172923- 172942. https://doi.org/10.1109/ACCESS.2024.3496929
  • Hadzi-Pavlovic, D. (2010). Finding patterns and groupings: II. Introduction to latent profile analysis and finite mixture models. Acta Neuropsychiatrica, 22(1), 40-42.https://doi.org/10.1111/j.1601-5215.2009.00442.x
  • Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate data analysis (8th ed.). Cengage Learning.
  • Hunt, L., & Jorgensen, M. (2011). Clustering mixed data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4), 352-361. https://doi.org/10.1002/widm.33
  • Kim, B. (2017). A fast K-prototypes algorithm using partial distance computation. Symmetry, 9(4), 58-68. https://doi.org/10.3390/sym9040058
  • MacQueen, J. (1967, January). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics (Vol. 5, pp. 281-298). University of California press.
  • Romero, C., & Ventura, S. (2020). Educational data mining and learning analytics: An updated survey. Wiley interdisciplinary reviews: Data mining and knowledge discovery, 10(3), 1-21. https://doi.org/10.1002/widm.1355
  • Saracco, J., & Chavent, M. (2016). Clustering of variables for mixed data. EAS Publications Series, 77, 121-169. https://doi.org/10.1051/eas/1677007
  • Sayadi, S., Geffard, E., Südholt, M., Vince, N., & Gourraud, P. A. (2021). Secure distribution of factor analysis of mixed data (FAMD) and its application to personalized medicine of transplanted patients. In L. Barolli, I. Woungang, & T. Enokido (Eds.), Advanced information networking and applications (AINA 2021) (Lecture Notes in Networks and Systems, Vol. 225, pp. 524–534). Springer. https://doi.org/10.1007/978-3-030-75100-5_44
  • Sujatha, K. (2012). Implementation of K-Modes algorithm to cluster very large categorical data sets in data mining. Data mining and knowledge engineering, 4, 481-486.
  • Schubert, E., & Rousseeuw, P. J. (2019). Faster K-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In Similarity Search and Applications: 12th International Conference, SISAP 2019, Newark, NJ, USA, October 2–4, 2019, Proceedings 12 (pp. 171-187). Springer International Publishing.
  • Van de Velden, M., Iodice D'Enza, A., & Markos, A. (2018). Distance‐based clustering of mixed data. Wiley Interdisciplinary Reviews: Computational Statistics, 11(3). https://doi.org/10.1002/wics.1456
  • Vermunt, J. K., & Magidson, J. (2002). Latent class cluster analysis. Applied latent class analysis, 11(89-106), 60.
  • Wang, Y., Yang, L., Wu, J., & Shi, L. (2022, July). Mining multi-source campus data: An empirical analysis of student portrait using clustering method. In 2022 5th International Conference on Data Science and Information Technology (DSIT) (pp. 01-06). IEEE.
  • Wu, J., Xiong, H., & Chen, J. (2009, June). Adapting the right measures for K-means clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 877-886).
  • Zhang, P., Wang, X., & Song, P. X. K. (2006). Clustering categorical data based on distance vectors. Journal of the American Statistical Association, 101(473), 355-367
There are 19 citations in total.

Details

Primary Language English
Subjects Measurement and Evaluation in Education (Other)
Journal Section Research Articles
Authors

Hüseyin Ataseven 0000-0001-9992-4518

Ömay Çokluk Bökeoğlu 0000-0002-3879-9204

Fazilet Taşdemir 0000-0002-0430-9094

Early Pub Date October 1, 2025
Publication Date October 13, 2025
Submission Date April 12, 2025
Acceptance Date May 26, 2025
Published in Issue Year 2025 Issue: 59

Cite

APA Ataseven, H., Çokluk Bökeoğlu, Ö., & Taşdemir, F. (2025). Comparison of Clustering Methods for Mixed Data: A Case Study on Hypothetical Student Scholarship Data. Educational Academic Research(59), 1-14. https://doi.org/10.33418/education.1674501

Content of this journal is licensed under a Creative Commons Attribution NonCommercial 4.0 International License
29929