Research Article
BibTex RIS Cite

Kümeleme Yapmadan Küme Sayısını Belirlemek için Yeni Bir Yöntem

Year 2025, Volume: 30 Issue: 2, 596 - 607, 31.08.2025
https://doi.org/10.53433/yyufbed.1612608

Abstract

Kümeleme yöntemleri, verilerdeki örüntüleri belirlemek için önemlidir ve küme sayısı, sonuçların kalitesini önemli ölçüde etkiler. Optimal küme sayısının belirlenmesi, özellikle büyük veri setlerinde zorlu bir görevdir, çünkü geleneksel yöntemler hesaplama açısından maliyetli olabilir. Küme sayısını belirlemek için verimli tekniklerin geliştirilmesi, özellikle büyük ölçekli uygulamalarda, kümelemenin doğruluğu ve ölçeklenebilirliğini iyileştirmek için kritik öneme sahiptir. Bu çalışmada, küme sayısını belirlemek için yeni bir yaklaşım sunulmuştur. Önerilen yöntem, kümeleme yapmadan sadece veri noktaları arası uzaklıkları temel alarak küme sayısını bulmayı amaçlar. Dirsek yöntemine benzer şekilde, veri noktaları arası uzaklıklar için dirsek noktası bulunup, bu dirsek noktası yardımıyla küme sayısı belirlenir. Önerilen algoritma, Dirsek yöntemiyle 11 veri seti ve 4 performans metriği kullanılarak karşılaştırılmıştır. Sonuçlar, önerilen yöntemin özellikle veri seti boyutu arttıkça zaman karmaşıklığı açısından daha avantajlı olduğunu göstermektedir.

References

  • Akogul, S., & Erisoglu, M. (2017). An approach for determining the number of clusters in a model-based cluster analysis. Entropy, 19(9), 452. https://doi.org/10.3390/e19090452
  • Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, in New Orleans, Louisiana.
  • Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1-27. https://doi.org/10.1080/03610927408827101
  • Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227. https://doi.org/10.1109/TPAMI.1979.4766909
  • Doan, H., & Nguyen, D. (2018). A method for finding the appropriate number of clusters. Int. Arab J. Inf. Technol., 15(4), 675-682.
  • Dua, D., & Graff, C. (2019). UCI machine learning repository. Access date: 10.12.2024 https://archive.ics.uci.edu/datasets
  • Halkidi, M., & Vazirgiannis, M. (2001, November). Clustering validity assessment: Finding the optimal partitioning of a data set. In Proceedings 2001 IEEE international conference on data mining (pp. 187-194). IEEE. https://doi.org/10.1109/ICDM.2001.989517
  • Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830
  • Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666. https://doi.org/10.1016/j.patrec.2009.09.011
  • Jung, Y., Park, H., Du, D. Z., & Drake, B. L. (2003). A decision criterion for the optimal number of clusters in hierarchical clustering. Journal of Global Optimization, 25(1), 91-111. https://doi.org/10.1023/A:1021394316112
  • Karna, A., & Gibert, K. (2022). Automatic identification of the number of clusters in hierarchical clustering. Neural Computing and Applications, 34(1), 119-134. https://doi.org/10.1007/s00521-021-05873-3
  • Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. Wiley-Interscience.
  • Kothari, R., & Pitts, D. (1999). On finding the number of clusters. Pattern Recognition Letters, 20(4), 405-416. https://doi.org/10.1016/S0167-8655(99)00008-2
  • Li, H., Zhang, S., Ding, X., Zhang, C., & Dale, P. (2016). Performance evaluation of cluster validity indices (CVIs) on multi-/hyperspectral remote sensing datasets. Remote Sensing, 8(4), 295. https://doi.org/10.3390/rs8040295
  • Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137. https://doi.org/10.1109/TIT.1982.1056489
  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. University of California Press.
  • Pakhira, M. K. (2012). Finding number of clusters before finding clusters. Procedia Technology, 4, 27-37. https://doi.org/10.1016/j.protcy.2012.05.004
  • Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1(2), 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
  • Thorndike, R. L. (1953). Who belongs in the family?. Psychometrika, 18(4), 267-276. https://doi.org/10.1007/BF02289263
  • Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293
  • Vinutha, H. P., Poornima, B., & Sagar, B. M. (2018). Detection of outliers using interquartile range technique from intrusion dataset. In Information and decision sciences: Proceedings of the 6th international conference on ficta (pp. 511-518). Springer Singapore. https://doi.org/10.1007/978-981-10-7563-6_53
  • Wiroonsri, N. (2024). Clustering performance analysis using a new correlation-based cluster validity index. Pattern Recognition, 145, 109910. https://doi.org/10.1016/j.patcog.2023.109910

A New Method for Determining the Number of Clusters Without Clustering

Year 2025, Volume: 30 Issue: 2, 596 - 607, 31.08.2025
https://doi.org/10.53433/yyufbed.1612608

Abstract

Clustering methods are essential for identifying patterns in data, and the number of clusters significantly impacts the quality of results. Determining the optimal number of clusters is challenging, particularly for large datasets, as traditional methods can be computationally expensive. Developing efficient techniques to determine the number of clusters is crucial for improving both the accuracy and scalability of clustering, especially in large-scale applications. In this study, a new approach for determining the number of clusters is presented. The proposed method aims to find the number of clusters based solely on the distances between data points, without performing clustering. Similar to the Elbow method, the elbow point is found for the distances between data points, and the number of clusters is determined using this elbow point. The proposed algorithm was compared with the Elbow method using 11 real-world datasets and 4 performance metrics. The results demonstrate that the proposed method is particularly advantageous in terms of time complexity, especially as the dataset size increases.

References

  • Akogul, S., & Erisoglu, M. (2017). An approach for determining the number of clusters in a model-based cluster analysis. Entropy, 19(9), 452. https://doi.org/10.3390/e19090452
  • Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, in New Orleans, Louisiana.
  • Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1-27. https://doi.org/10.1080/03610927408827101
  • Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227. https://doi.org/10.1109/TPAMI.1979.4766909
  • Doan, H., & Nguyen, D. (2018). A method for finding the appropriate number of clusters. Int. Arab J. Inf. Technol., 15(4), 675-682.
  • Dua, D., & Graff, C. (2019). UCI machine learning repository. Access date: 10.12.2024 https://archive.ics.uci.edu/datasets
  • Halkidi, M., & Vazirgiannis, M. (2001, November). Clustering validity assessment: Finding the optimal partitioning of a data set. In Proceedings 2001 IEEE international conference on data mining (pp. 187-194). IEEE. https://doi.org/10.1109/ICDM.2001.989517
  • Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830
  • Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666. https://doi.org/10.1016/j.patrec.2009.09.011
  • Jung, Y., Park, H., Du, D. Z., & Drake, B. L. (2003). A decision criterion for the optimal number of clusters in hierarchical clustering. Journal of Global Optimization, 25(1), 91-111. https://doi.org/10.1023/A:1021394316112
  • Karna, A., & Gibert, K. (2022). Automatic identification of the number of clusters in hierarchical clustering. Neural Computing and Applications, 34(1), 119-134. https://doi.org/10.1007/s00521-021-05873-3
  • Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. Wiley-Interscience.
  • Kothari, R., & Pitts, D. (1999). On finding the number of clusters. Pattern Recognition Letters, 20(4), 405-416. https://doi.org/10.1016/S0167-8655(99)00008-2
  • Li, H., Zhang, S., Ding, X., Zhang, C., & Dale, P. (2016). Performance evaluation of cluster validity indices (CVIs) on multi-/hyperspectral remote sensing datasets. Remote Sensing, 8(4), 295. https://doi.org/10.3390/rs8040295
  • Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137. https://doi.org/10.1109/TIT.1982.1056489
  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. University of California Press.
  • Pakhira, M. K. (2012). Finding number of clusters before finding clusters. Procedia Technology, 4, 27-37. https://doi.org/10.1016/j.protcy.2012.05.004
  • Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1(2), 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
  • Thorndike, R. L. (1953). Who belongs in the family?. Psychometrika, 18(4), 267-276. https://doi.org/10.1007/BF02289263
  • Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293
  • Vinutha, H. P., Poornima, B., & Sagar, B. M. (2018). Detection of outliers using interquartile range technique from intrusion dataset. In Information and decision sciences: Proceedings of the 6th international conference on ficta (pp. 511-518). Springer Singapore. https://doi.org/10.1007/978-981-10-7563-6_53
  • Wiroonsri, N. (2024). Clustering performance analysis using a new correlation-based cluster validity index. Pattern Recognition, 145, 109910. https://doi.org/10.1016/j.patcog.2023.109910
There are 22 citations in total.

Details

Primary Language English
Subjects Algorithms and Calculation Theory, Data Structures and Algorithms
Journal Section Engineering and Architecture / Mühendislik ve Mimarlık
Authors

Duygu Selin Turan 0000-0001-9881-6013

Publication Date August 31, 2025
Submission Date January 3, 2025
Acceptance Date May 13, 2025
Published in Issue Year 2025 Volume: 30 Issue: 2

Cite

APA Turan, D. S. (2025). A New Method for Determining the Number of Clusters Without Clustering. Yüzüncü Yıl Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 30(2), 596-607. https://doi.org/10.53433/yyufbed.1612608