Comparison of Clustering Methods for Mixed Data: A Case Study on Hypothetical Student Scholarship Data

Hüseyin Ataseven; Ömay Çokluk Bökeoğlu; Fazilet Taşdemir

doi:10.33418/education.1674501

Research Article

BibTex

RIS

Cite

Karma Veriler İçin Kümeleme Yöntemlerinin Karşılaştırılması: Hipotetik Öğrenci Burs Verisi Üzerine Durum Çalışması

Year 2025, Issue: 59, 1 - 14

Hüseyin Ataseven , Ömay Çokluk Bökeoğlu , Fazilet Taşdemir

https://doi.org/10.33418/education.1674501

Abstract

Kümeleme, karmaşık veri setlerinde desenleri ortaya çıkarmak ve bireyleri gruplayabilmek için yaygın olarak kullanılan bir tekniktir. Özellikle akademik ve bağlamsal değişkenlerin önemli olduğu eğitim alanında sıklıkla kullanılmaktadır. Bu çalışma, R yazılımında üretilmiş hipotetik bir öğrenci bursu veri seti kullanılarak, öğrencilerin burs uygunluk gruplarına sınıflandırılmasında altı kümeleme yönteminin temellerini tanıtmayı ve performansını incelemeyi amaçlamaktadır. Veri seti, tipik burs değerlendirme ölçütlerini yansıtan iki sayısal değişken (GNO ve Bursluluk Sınavı Sonucu) ile dört kategorik değişkenden değişkenden (Maddi İhtiyaç Durumu, Çalışan Ebeveyn Sayısı, Öğrencinin Çalışma Durumu ve Konaklama Türü) oluşmaktadır. Öğrenciler "Asil Adaylar", "Yedek Adaylar" ve "Reddedilen Adaylar" olarak etiketlenmiş ve kümeleme yöntemlerininn (K-Ortalamalar, K-Modlar, K-Prototipler, Medoidlere Bölme, Gizil Sınıf Analizi ve Karma Verilerle Faktör Analizi sonrasında K-Ortalamalar) bu etiketleri ne ölçüde doğru biçimde yeniden ürettikleri açısından değerlendirilmiştir. Bulgular, özellikle K-Prototipler (%95,6) ve Medoidlere Bölme (%92,5) gibi hibrit yaklaşımların en yüksek doğruluğa ulaştığını göstermektedir. Karma Verilerle Faktör Analizi sonrasında K-Ortalamalar (%93,9) boyut indirgeme yoluyla güçlü bir alternatif sunarken, Gizil Sınıf Analizi %85,9 doğruluk sağlamıştır. Bulgular, kümeleme uygulamalarında kategorik değişkenlerin değerini ortaya koymakta ve özellikle burs seçimi gibi yüksek riskli bağlamlarda karma tipte eğitim verileri için uygun kümeleme tekniklerinin seçilmesinin önemini vurgulamaktadır.

Keywords

Karma veri ile kümeleme , K-ortalamalar , K-prototipler , Gizil sınıf analizi , Karma veri ile faktör analizi

References

Ahmad, A., & Khan, S. S. (2019). Survey of state-of-the-art mixed data clustering algorithms. IEEE Access, 7, 31883-31902. https://doi.org/10.1109/ACCESS.2019.2903568
Bektas, A., & Schumann, R. (2019, June). How to optimize Gower distance weights for the k-medoids clustering algorithm to obtain mobility profiles of the Swiss population. In 2019 6th Swiss Conference on Data Science (SDS) (pp. 51-56). IEEE. https://doi.org/10.1109/SDS.2019.000-8.
Costa, E., Papatsouma, I., & Markos, A. (2023). Benchmarking distance-based partitioning methods for mixed-type data. Advances in Data Analysis and Classification, 17(3), 701-724. https://doi.org/10.1007/s11634-022-00521-7
Dutt, A., Ismail, M. A., Herawan, T., & Targio, I. A. (2024). Partition-Based Clustering Algorithms Applied to Mixed Data for Educational Data Mining: A Survey From 1971 to 2024. IEEE Access 12, 172923- 172942. https://doi.org/10.1109/ACCESS.2024.3496929
Hadzi-Pavlovic, D. (2010). Finding patterns and groupings: II. Introduction to latent profile analysis and finite mixture models. Acta Neuropsychiatrica, 22(1), 40-42.https://doi.org/10.1111/j.1601-5215.2009.00442.x
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate data analysis (8th ed.). Cengage Learning.
Hunt, L., & Jorgensen, M. (2011). Clustering mixed data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4), 352-361. https://doi.org/10.1002/widm.33
Kim, B. (2017). A fast K-prototypes algorithm using partial distance computation. Symmetry, 9(4), 58-68. https://doi.org/10.3390/sym9040058
MacQueen, J. (1967, January). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics (Vol. 5, pp. 281-298). University of California press.
Romero, C., & Ventura, S. (2020). Educational data mining and learning analytics: An updated survey. Wiley interdisciplinary reviews: Data mining and knowledge discovery, 10(3), 1-21. https://doi.org/10.1002/widm.1355
Saracco, J., & Chavent, M. (2016). Clustering of variables for mixed data. EAS Publications Series, 77, 121-169. https://doi.org/10.1051/eas/1677007
Sayadi, S., Geffard, E., Südholt, M., Vince, N., & Gourraud, P. A. (2021). Secure distribution of factor analysis of mixed data (FAMD) and its application to personalized medicine of transplanted patients. In L. Barolli, I. Woungang, & T. Enokido (Eds.), Advanced information networking and applications (AINA 2021) (Lecture Notes in Networks and Systems, Vol. 225, pp. 524–534). Springer. https://doi.org/10.1007/978-3-030-75100-5_44
Sujatha, K. (2012). Implementation of K-Modes algorithm to cluster very large categorical data sets in data mining. Data mining and knowledge engineering, 4, 481-486.
Schubert, E., & Rousseeuw, P. J. (2019). Faster K-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In Similarity Search and Applications: 12th International Conference, SISAP 2019, Newark, NJ, USA, October 2–4, 2019, Proceedings 12 (pp. 171-187). Springer International Publishing.
Van de Velden, M., Iodice D'Enza, A., & Markos, A. (2018). Distance‐based clustering of mixed data. Wiley Interdisciplinary Reviews: Computational Statistics, 11(3). https://doi.org/10.1002/wics.1456
Vermunt, J. K., & Magidson, J. (2002). Latent class cluster analysis. Applied latent class analysis, 11(89-106), 60.
Wang, Y., Yang, L., Wu, J., & Shi, L. (2022, July). Mining multi-source campus data: An empirical analysis of student portrait using clustering method. In 2022 5th International Conference on Data Science and Information Technology (DSIT) (pp. 01-06). IEEE.
Wu, J., Xiong, H., & Chen, J. (2009, June). Adapting the right measures for K-means clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 877-886).
Zhang, P., Wang, X., & Song, P. X. K. (2006). Clustering categorical data based on distance vectors. Journal of the American Statistical Association, 101(473), 355-367

Comparison of Clustering Methods for Mixed Data: A Case Study on Hypothetical Student Scholarship Data

Year 2025, Issue: 59, 1 - 14

Hüseyin Ataseven , Ömay Çokluk Bökeoğlu , Fazilet Taşdemir

https://doi.org/10.33418/education.1674501

Abstract

Clustering is a widely used technique for uncovering patterns and grouping individuals within complex datasets, particularly in fields like education where both academic and contextual variables are essential. This study aims to introduce the basics and explore the performance of six clustering methods in classifying students into scholarship eligibility groups using a hypothetical student scholarship dataset generated in R software. The dataset consists of two numerical variables (GPA and Scholarship Exam Result) and four categorical variables (Financial Need, Number of Parents Employed, Employment Status, and Accommodation), reflecting typical criteria in educational funding decisions. Students were labeled as Primary, Secondary, or Rejected Candidates, and the clustering methods—K-Means, K-Modes, K-Prototypes, Partitioning Around Medoids (PAM), Latent Class Analysis (LCA), and Factor Analysis for Mixed Data (FAMD) followed by K-Means—were assessed based on how accurately they reproduced these labels. Results indicate that hybrid approaches, particularly K-Prototypes (95.6%) and PAM (92.5%), achieved the highest accuracy. FAMD + K-Means (93.9%) offered a robust alternative through dimensionality reduction while LCA produced an 85.9% accuracy. The findings highlight the value of categorical variables in clustering applications, and it also demonstrates the importance of selecting suitable clustering techniques for mixed-type educational data, especially in high-stakes contexts such as scholarship selection.

Keywords

Clustering mixed data , K-Means , K-Prototypes , Latent Class Analysis , Factor Analysis with Mixed Data

References

Ahmad, A., & Khan, S. S. (2019). Survey of state-of-the-art mixed data clustering algorithms. IEEE Access, 7, 31883-31902. https://doi.org/10.1109/ACCESS.2019.2903568
Bektas, A., & Schumann, R. (2019, June). How to optimize Gower distance weights for the k-medoids clustering algorithm to obtain mobility profiles of the Swiss population. In 2019 6th Swiss Conference on Data Science (SDS) (pp. 51-56). IEEE. https://doi.org/10.1109/SDS.2019.000-8.
Costa, E., Papatsouma, I., & Markos, A. (2023). Benchmarking distance-based partitioning methods for mixed-type data. Advances in Data Analysis and Classification, 17(3), 701-724. https://doi.org/10.1007/s11634-022-00521-7
Dutt, A., Ismail, M. A., Herawan, T., & Targio, I. A. (2024). Partition-Based Clustering Algorithms Applied to Mixed Data for Educational Data Mining: A Survey From 1971 to 2024. IEEE Access 12, 172923- 172942. https://doi.org/10.1109/ACCESS.2024.3496929
Hadzi-Pavlovic, D. (2010). Finding patterns and groupings: II. Introduction to latent profile analysis and finite mixture models. Acta Neuropsychiatrica, 22(1), 40-42.https://doi.org/10.1111/j.1601-5215.2009.00442.x
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate data analysis (8th ed.). Cengage Learning.
Hunt, L., & Jorgensen, M. (2011). Clustering mixed data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4), 352-361. https://doi.org/10.1002/widm.33
Kim, B. (2017). A fast K-prototypes algorithm using partial distance computation. Symmetry, 9(4), 58-68. https://doi.org/10.3390/sym9040058
MacQueen, J. (1967, January). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics (Vol. 5, pp. 281-298). University of California press.
Romero, C., & Ventura, S. (2020). Educational data mining and learning analytics: An updated survey. Wiley interdisciplinary reviews: Data mining and knowledge discovery, 10(3), 1-21. https://doi.org/10.1002/widm.1355
Saracco, J., & Chavent, M. (2016). Clustering of variables for mixed data. EAS Publications Series, 77, 121-169. https://doi.org/10.1051/eas/1677007
Sayadi, S., Geffard, E., Südholt, M., Vince, N., & Gourraud, P. A. (2021). Secure distribution of factor analysis of mixed data (FAMD) and its application to personalized medicine of transplanted patients. In L. Barolli, I. Woungang, & T. Enokido (Eds.), Advanced information networking and applications (AINA 2021) (Lecture Notes in Networks and Systems, Vol. 225, pp. 524–534). Springer. https://doi.org/10.1007/978-3-030-75100-5_44
Sujatha, K. (2012). Implementation of K-Modes algorithm to cluster very large categorical data sets in data mining. Data mining and knowledge engineering, 4, 481-486.
Schubert, E., & Rousseeuw, P. J. (2019). Faster K-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In Similarity Search and Applications: 12th International Conference, SISAP 2019, Newark, NJ, USA, October 2–4, 2019, Proceedings 12 (pp. 171-187). Springer International Publishing.
Van de Velden, M., Iodice D'Enza, A., & Markos, A. (2018). Distance‐based clustering of mixed data. Wiley Interdisciplinary Reviews: Computational Statistics, 11(3). https://doi.org/10.1002/wics.1456
Vermunt, J. K., & Magidson, J. (2002). Latent class cluster analysis. Applied latent class analysis, 11(89-106), 60.
Wang, Y., Yang, L., Wu, J., & Shi, L. (2022, July). Mining multi-source campus data: An empirical analysis of student portrait using clustering method. In 2022 5th International Conference on Data Science and Information Technology (DSIT) (pp. 01-06). IEEE.
Wu, J., Xiong, H., & Chen, J. (2009, June). Adapting the right measures for K-means clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 877-886).
Zhang, P., Wang, X., & Song, P. X. K. (2006). Clustering categorical data based on distance vectors. Journal of the American Statistical Association, 101(473), 355-367

There are 19 citations in total.

Details

Primary Language	English
Subjects	Measurement and Evaluation in Education (Other)
Journal Section	Research Articles
Authors	Hüseyin Ataseven 0000-0001-9992-4518 Ömay Çokluk Bökeoğlu 0000-0002-3879-9204 Fazilet Taşdemir 0000-0002-0430-9094
Early Pub Date	October 1, 2025
Publication Date	October 13, 2025
Submission Date	April 12, 2025
Acceptance Date	May 26, 2025
Published in Issue	Year 2025 Issue: 59

Cite

APA	Ataseven, H., Çokluk Bökeoğlu, Ö., & Taşdemir, F. (2025). Comparison of Clustering Methods for Mixed Data: A Case Study on Hypothetical Student Scholarship Data. Educational Academic Research(59), 1-14. https://doi.org/10.33418/education.1674501

Download Cover Image

Article Files

Full Text

Content of this journal is licensed under a Creative Commons Attribution NonCommercial 4.0 International License
29929