A New Method for Determining the Number of Clusters Without Clustering

Duygu Selin Turan

doi:10.53433/yyufbed.1612608

Research Article

Kümeleme Yapmadan Küme Sayısını Belirlemek için Yeni Bir Yöntem

Year 2025, Volume: 30 Issue: 2, 596 - 607, 31.08.2025

Duygu Selin Turan

https://doi.org/10.53433/yyufbed.1612608

Abstract

Kümeleme yöntemleri, verilerdeki örüntüleri belirlemek için önemlidir ve küme sayısı, sonuçların kalitesini önemli ölçüde etkiler. Optimal küme sayısının belirlenmesi, özellikle büyük veri setlerinde zorlu bir görevdir, çünkü geleneksel yöntemler hesaplama açısından maliyetli olabilir. Küme sayısını belirlemek için verimli tekniklerin geliştirilmesi, özellikle büyük ölçekli uygulamalarda, kümelemenin doğruluğu ve ölçeklenebilirliğini iyileştirmek için kritik öneme sahiptir. Bu çalışmada, küme sayısını belirlemek için yeni bir yaklaşım sunulmuştur. Önerilen yöntem, kümeleme yapmadan sadece veri noktaları arası uzaklıkları temel alarak küme sayısını bulmayı amaçlar. Dirsek yöntemine benzer şekilde, veri noktaları arası uzaklıklar için dirsek noktası bulunup, bu dirsek noktası yardımıyla küme sayısı belirlenir. Önerilen algoritma, Dirsek yöntemiyle 11 veri seti ve 4 performans metriği kullanılarak karşılaştırılmıştır. Sonuçlar, önerilen yöntemin özellikle veri seti boyutu arttıkça zaman karmaşıklığı açısından daha avantajlı olduğunu göstermektedir.

Keywords

Karşılaştırmalı analiz , Kümeleme , Küme geçerlilik indeksi

References

Akogul, S., & Erisoglu, M. (2017). An approach for determining the number of clusters in a model-based cluster analysis. Entropy, 19(9), 452. https://doi.org/10.3390/e19090452
Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, in New Orleans, Louisiana.
Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1-27. https://doi.org/10.1080/03610927408827101
Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227. https://doi.org/10.1109/TPAMI.1979.4766909
Doan, H., & Nguyen, D. (2018). A method for finding the appropriate number of clusters. Int. Arab J. Inf. Technol., 15(4), 675-682.
Dua, D., & Graff, C. (2019). UCI machine learning repository. Access date: 10.12.2024 https://archive.ics.uci.edu/datasets
Halkidi, M., & Vazirgiannis, M. (2001, November). Clustering validity assessment: Finding the optimal partitioning of a data set. In Proceedings 2001 IEEE international conference on data mining (pp. 187-194). IEEE. https://doi.org/10.1109/ICDM.2001.989517
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666. https://doi.org/10.1016/j.patrec.2009.09.011
Jung, Y., Park, H., Du, D. Z., & Drake, B. L. (2003). A decision criterion for the optimal number of clusters in hierarchical clustering. Journal of Global Optimization, 25(1), 91-111. https://doi.org/10.1023/A:1021394316112
Karna, A., & Gibert, K. (2022). Automatic identification of the number of clusters in hierarchical clustering. Neural Computing and Applications, 34(1), 119-134. https://doi.org/10.1007/s00521-021-05873-3
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. Wiley-Interscience.
Kothari, R., & Pitts, D. (1999). On finding the number of clusters. Pattern Recognition Letters, 20(4), 405-416. https://doi.org/10.1016/S0167-8655(99)00008-2
Li, H., Zhang, S., Ding, X., Zhang, C., & Dale, P. (2016). Performance evaluation of cluster validity indices (CVIs) on multi-/hyperspectral remote sensing datasets. Remote Sensing, 8(4), 295. https://doi.org/10.3390/rs8040295
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137. https://doi.org/10.1109/TIT.1982.1056489
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. University of California Press.
Pakhira, M. K. (2012). Finding number of clusters before finding clusters. Procedia Technology, 4, 27-37. https://doi.org/10.1016/j.protcy.2012.05.004
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1(2), 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
Thorndike, R. L. (1953). Who belongs in the family?. Psychometrika, 18(4), 267-276. https://doi.org/10.1007/BF02289263
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293
Vinutha, H. P., Poornima, B., & Sagar, B. M. (2018). Detection of outliers using interquartile range technique from intrusion dataset. In Information and decision sciences: Proceedings of the 6th international conference on ficta (pp. 511-518). Springer Singapore. https://doi.org/10.1007/978-981-10-7563-6_53
Wiroonsri, N. (2024). Clustering performance analysis using a new correlation-based cluster validity index. Pattern Recognition, 145, 109910. https://doi.org/10.1016/j.patcog.2023.109910

A New Method for Determining the Number of Clusters Without Clustering

Year 2025, Volume: 30 Issue: 2, 596 - 607, 31.08.2025

Duygu Selin Turan

https://doi.org/10.53433/yyufbed.1612608

Abstract

Clustering methods are essential for identifying patterns in data, and the number of clusters significantly impacts the quality of results. Determining the optimal number of clusters is challenging, particularly for large datasets, as traditional methods can be computationally expensive. Developing efficient techniques to determine the number of clusters is crucial for improving both the accuracy and scalability of clustering, especially in large-scale applications. In this study, a new approach for determining the number of clusters is presented. The proposed method aims to find the number of clusters based solely on the distances between data points, without performing clustering. Similar to the Elbow method, the elbow point is found for the distances between data points, and the number of clusters is determined using this elbow point. The proposed algorithm was compared with the Elbow method using 11 real-world datasets and 4 performance metrics. The results demonstrate that the proposed method is particularly advantageous in terms of time complexity, especially as the dataset size increases.

Keywords

Clustering , Comparative analysis , Cluster validity index

References

Akogul, S., & Erisoglu, M. (2017). An approach for determining the number of clusters in a model-based cluster analysis. Entropy, 19(9), 452. https://doi.org/10.3390/e19090452
Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, in New Orleans, Louisiana.
Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1-27. https://doi.org/10.1080/03610927408827101
Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227. https://doi.org/10.1109/TPAMI.1979.4766909
Doan, H., & Nguyen, D. (2018). A method for finding the appropriate number of clusters. Int. Arab J. Inf. Technol., 15(4), 675-682.
Dua, D., & Graff, C. (2019). UCI machine learning repository. Access date: 10.12.2024 https://archive.ics.uci.edu/datasets
Halkidi, M., & Vazirgiannis, M. (2001, November). Clustering validity assessment: Finding the optimal partitioning of a data set. In Proceedings 2001 IEEE international conference on data mining (pp. 187-194). IEEE. https://doi.org/10.1109/ICDM.2001.989517
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666. https://doi.org/10.1016/j.patrec.2009.09.011
Jung, Y., Park, H., Du, D. Z., & Drake, B. L. (2003). A decision criterion for the optimal number of clusters in hierarchical clustering. Journal of Global Optimization, 25(1), 91-111. https://doi.org/10.1023/A:1021394316112
Karna, A., & Gibert, K. (2022). Automatic identification of the number of clusters in hierarchical clustering. Neural Computing and Applications, 34(1), 119-134. https://doi.org/10.1007/s00521-021-05873-3
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. Wiley-Interscience.
Kothari, R., & Pitts, D. (1999). On finding the number of clusters. Pattern Recognition Letters, 20(4), 405-416. https://doi.org/10.1016/S0167-8655(99)00008-2
Li, H., Zhang, S., Ding, X., Zhang, C., & Dale, P. (2016). Performance evaluation of cluster validity indices (CVIs) on multi-/hyperspectral remote sensing datasets. Remote Sensing, 8(4), 295. https://doi.org/10.3390/rs8040295
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137. https://doi.org/10.1109/TIT.1982.1056489
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. University of California Press.
Pakhira, M. K. (2012). Finding number of clusters before finding clusters. Procedia Technology, 4, 27-37. https://doi.org/10.1016/j.protcy.2012.05.004
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1(2), 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
Thorndike, R. L. (1953). Who belongs in the family?. Psychometrika, 18(4), 267-276. https://doi.org/10.1007/BF02289263
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293
Vinutha, H. P., Poornima, B., & Sagar, B. M. (2018). Detection of outliers using interquartile range technique from intrusion dataset. In Information and decision sciences: Proceedings of the 6th international conference on ficta (pp. 511-518). Springer Singapore. https://doi.org/10.1007/978-981-10-7563-6_53
Wiroonsri, N. (2024). Clustering performance analysis using a new correlation-based cluster validity index. Pattern Recognition, 145, 109910. https://doi.org/10.1016/j.patcog.2023.109910

There are 22 citations in total.

Details

Primary Language	English
Subjects	Algorithms and Calculation Theory, Data Structures and Algorithms
Journal Section	Engineering and Architecture / Mühendislik ve Mimarlık
Authors	Duygu Selin Turan 0000-0001-9881-6013
Publication Date	August 31, 2025
Submission Date	January 3, 2025
Acceptance Date	May 13, 2025
Published in Issue	Year 2025 Volume: 30 Issue: 2

Cite

APA	Turan, D. S. (2025). A New Method for Determining the Number of Clusters Without Clustering. Yüzüncü Yıl Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 30(2), 596-607. https://doi.org/10.53433/yyufbed.1612608

Download Cover Image

Article Files

Full Text