TY - JOUR T1 - A New Method for Determining the Number of Clusters Without Clustering TT - Kümeleme Yapmadan Küme Sayısını Belirlemek için Yeni Bir Yöntem AU - Turan, Duygu Selin PY - 2025 DA - August Y2 - 2025 DO - 10.53433/yyufbed.1612608 JF - Yüzüncü Yıl Üniversitesi Fen Bilimleri Enstitüsü Dergisi JO - YYU JINAS PB - Van Yüzüncü Yıl Üniversitesi WT - DergiPark SN - 1300-5413 SP - 596 EP - 607 VL - 30 IS - 2 LA - en AB - Clustering methods are essential for identifying patterns in data, and the number of clusters significantly impacts the quality of results. Determining the optimal number of clusters is challenging, particularly for large datasets, as traditional methods can be computationally expensive. Developing efficient techniques to determine the number of clusters is crucial for improving both the accuracy and scalability of clustering, especially in large-scale applications. In this study, a new approach for determining the number of clusters is presented. The proposed method aims to find the number of clusters based solely on the distances between data points, without performing clustering. Similar to the Elbow method, the elbow point is found for the distances between data points, and the number of clusters is determined using this elbow point. The proposed algorithm was compared with the Elbow method using 11 real-world datasets and 4 performance metrics. The results demonstrate that the proposed method is particularly advantageous in terms of time complexity, especially as the dataset size increases. KW - Clustering KW - Comparative analysis KW - Cluster validity index N2 - Kümeleme yöntemleri, verilerdeki örüntüleri belirlemek için önemlidir ve küme sayısı, sonuçların kalitesini önemli ölçüde etkiler. Optimal küme sayısının belirlenmesi, özellikle büyük veri setlerinde zorlu bir görevdir, çünkü geleneksel yöntemler hesaplama açısından maliyetli olabilir. Küme sayısını belirlemek için verimli tekniklerin geliştirilmesi, özellikle büyük ölçekli uygulamalarda, kümelemenin doğruluğu ve ölçeklenebilirliğini iyileştirmek için kritik öneme sahiptir. Bu çalışmada, küme sayısını belirlemek için yeni bir yaklaşım sunulmuştur. Önerilen yöntem, kümeleme yapmadan sadece veri noktaları arası uzaklıkları temel alarak küme sayısını bulmayı amaçlar. Dirsek yöntemine benzer şekilde, veri noktaları arası uzaklıklar için dirsek noktası bulunup, bu dirsek noktası yardımıyla küme sayısı belirlenir. Önerilen algoritma, Dirsek yöntemiyle 11 veri seti ve 4 performans metriği kullanılarak karşılaştırılmıştır. Sonuçlar, önerilen yöntemin özellikle veri seti boyutu arttıkça zaman karmaşıklığı açısından daha avantajlı olduğunu göstermektedir. CR - Akogul, S., & Erisoglu, M. (2017). An approach for determining the number of clusters in a model-based cluster analysis. Entropy, 19(9), 452. https://doi.org/10.3390/e19090452 CR - Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, in New Orleans, Louisiana. CR - Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1-27. https://doi.org/10.1080/03610927408827101 CR - Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227. https://doi.org/10.1109/TPAMI.1979.4766909 CR - Doan, H., & Nguyen, D. (2018). A method for finding the appropriate number of clusters. Int. Arab J. Inf. Technol., 15(4), 675-682. CR - Dua, D., & Graff, C. (2019). UCI machine learning repository. Access date: 10.12.2024 https://archive.ics.uci.edu/datasets CR - Halkidi, M., & Vazirgiannis, M. (2001, November). Clustering validity assessment: Finding the optimal partitioning of a data set. In Proceedings 2001 IEEE international conference on data mining (pp. 187-194). IEEE. https://doi.org/10.1109/ICDM.2001.989517 CR - Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830 CR - Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666. https://doi.org/10.1016/j.patrec.2009.09.011 CR - Jung, Y., Park, H., Du, D. Z., & Drake, B. L. (2003). A decision criterion for the optimal number of clusters in hierarchical clustering. Journal of Global Optimization, 25(1), 91-111. https://doi.org/10.1023/A:1021394316112 CR - Karna, A., & Gibert, K. (2022). Automatic identification of the number of clusters in hierarchical clustering. Neural Computing and Applications, 34(1), 119-134. https://doi.org/10.1007/s00521-021-05873-3 CR - Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. Wiley-Interscience. CR - Kothari, R., & Pitts, D. (1999). On finding the number of clusters. Pattern Recognition Letters, 20(4), 405-416. https://doi.org/10.1016/S0167-8655(99)00008-2 CR - Li, H., Zhang, S., Ding, X., Zhang, C., & Dale, P. (2016). Performance evaluation of cluster validity indices (CVIs) on multi-/hyperspectral remote sensing datasets. Remote Sensing, 8(4), 295. https://doi.org/10.3390/rs8040295 CR - Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137. https://doi.org/10.1109/TIT.1982.1056489 CR - MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. University of California Press. CR - Pakhira, M. K. (2012). Finding number of clusters before finding clusters. Procedia Technology, 4, 27-37. https://doi.org/10.1016/j.protcy.2012.05.004 CR - Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1(2), 53-65. https://doi.org/10.1016/0377-0427(87)90125-7 CR - Thorndike, R. L. (1953). Who belongs in the family?. Psychometrika, 18(4), 267-276. https://doi.org/10.1007/BF02289263 CR - Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293 CR - Vinutha, H. P., Poornima, B., & Sagar, B. M. (2018). Detection of outliers using interquartile range technique from intrusion dataset. In Information and decision sciences: Proceedings of the 6th international conference on ficta (pp. 511-518). Springer Singapore. https://doi.org/10.1007/978-981-10-7563-6_53 CR - Wiroonsri, N. (2024). Clustering performance analysis using a new correlation-based cluster validity index. Pattern Recognition, 145, 109910. https://doi.org/10.1016/j.patcog.2023.109910 UR - https://doi.org/10.53433/yyufbed.1612608 L1 - https://dergipark.org.tr/tr/download/article-file/4489584 ER -