TY  - JOUR
T1  - A New Method  for Determining the Number of Clusters Without Clustering
TT  - Kümeleme Yapmadan  Küme Sayısını Belirlemek için Yeni Bir Yöntem
AU  - Turan, Duygu Selin
PY  - 2025
DA  - August
Y2  - 2025
DO  - 10.53433/yyufbed.1612608
JF  - Yüzüncü Yıl Üniversitesi Fen Bilimleri Enstitüsü Dergisi
JO  - YYU JINAS
PB  - Van Yüzüncü Yıl Üniversitesi
WT  - DergiPark
SN  - 1300-5413
SP  - 596
EP  - 607
VL  - 30
IS  - 2
LA  - en
AB  - Clustering methods are essential for identifying patterns in data, and the number of clusters significantly impacts the quality of results. Determining the optimal number of clusters is challenging, particularly for large datasets, as traditional methods can be computationally expensive. Developing efficient techniques to determine the number of clusters is crucial for improving both the accuracy and scalability of clustering, especially in large-scale applications. In this study, a new approach for determining the number of clusters is presented. The proposed method aims to find the number of clusters based solely on the distances between data points, without performing clustering. Similar to the Elbow method, the elbow point is found for the distances between data points, and the number of clusters is determined using this elbow point. The proposed algorithm was compared with the Elbow method using 11 real-world datasets and 4 performance metrics. The results demonstrate that the proposed method is particularly advantageous in terms of time complexity, especially as the dataset size increases.
KW  - Clustering
KW  - Comparative analysis
KW  - Cluster validity index
N2  - Kümeleme yöntemleri, verilerdeki örüntüleri belirlemek için önemlidir ve küme sayısı, sonuçların kalitesini önemli ölçüde etkiler. Optimal küme sayısının belirlenmesi, özellikle büyük veri setlerinde zorlu bir görevdir, çünkü geleneksel yöntemler hesaplama açısından maliyetli olabilir. Küme sayısını belirlemek için verimli tekniklerin geliştirilmesi, özellikle büyük ölçekli uygulamalarda, kümelemenin doğruluğu ve ölçeklenebilirliğini iyileştirmek için kritik öneme sahiptir. Bu çalışmada, küme sayısını belirlemek için yeni bir yaklaşım sunulmuştur. Önerilen yöntem, kümeleme yapmadan sadece veri noktaları arası uzaklıkları temel alarak küme sayısını bulmayı amaçlar. Dirsek yöntemine benzer şekilde, veri noktaları arası uzaklıklar için dirsek noktası bulunup, bu dirsek noktası yardımıyla küme sayısı belirlenir. Önerilen algoritma, Dirsek yöntemiyle 11 veri seti ve 4 performans metriği kullanılarak karşılaştırılmıştır. Sonuçlar, önerilen yöntemin özellikle veri seti boyutu arttıkça zaman karmaşıklığı açısından daha avantajlı olduğunu göstermektedir.
CR  - Akogul, S., &amp; Erisoglu, M. (2017). An approach for determining the number of clusters in a model-based cluster analysis. Entropy, 19(9), 452. https://doi.org/10.3390/e19090452
CR  - Arthur, D., &amp; Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, in New Orleans, Louisiana.
CR  - Calinski, T., &amp; Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1-27. https://doi.org/10.1080/03610927408827101
CR  - Davies, D. L., &amp; Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227. https://doi.org/10.1109/TPAMI.1979.4766909
CR  - Doan, H., &amp; Nguyen, D. (2018). A method for finding the appropriate number of clusters. Int. Arab J. Inf. Technol., 15(4), 675-682.
CR  - Dua, D., &amp; Graff, C. (2019). UCI machine learning repository. Access date: 10.12.2024 https://archive.ics.uci.edu/datasets
CR  - Halkidi, M., &amp; Vazirgiannis, M. (2001, November). Clustering validity assessment: Finding the optimal partitioning of a data set. In Proceedings 2001 IEEE international conference on data mining (pp. 187-194). IEEE. https://doi.org/10.1109/ICDM.2001.989517
CR  - Hartigan, J. A., &amp; Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830
CR  - Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666. https://doi.org/10.1016/j.patrec.2009.09.011
CR  - Jung, Y., Park, H., Du, D. Z., &amp; Drake, B. L. (2003). A decision criterion for the optimal number of clusters in hierarchical clustering. Journal of Global Optimization, 25(1), 91-111. https://doi.org/10.1023/A:1021394316112
CR  - Karna, A., &amp; Gibert, K. (2022). Automatic identification of the number of clusters in hierarchical clustering. Neural Computing and Applications, 34(1), 119-134. https://doi.org/10.1007/s00521-021-05873-3
CR  - Kaufman, L., &amp; Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. Wiley-Interscience.
CR  - Kothari, R., &amp; Pitts, D. (1999). On finding the number of clusters. Pattern Recognition Letters, 20(4), 405-416. https://doi.org/10.1016/S0167-8655(99)00008-2
CR  - Li, H., Zhang, S., Ding, X., Zhang, C., &amp; Dale, P. (2016). Performance evaluation of cluster validity indices (CVIs) on multi-/hyperspectral remote sensing datasets. Remote Sensing, 8(4), 295. https://doi.org/10.3390/rs8040295
CR  - Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137. https://doi.org/10.1109/TIT.1982.1056489
CR  - MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. University of California Press.
CR  - Pakhira, M. K. (2012). Finding number of clusters before finding clusters. Procedia Technology, 4, 27-37. https://doi.org/10.1016/j.protcy.2012.05.004
CR  - Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1(2), 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
CR  - Thorndike, R. L. (1953). Who belongs in the family?. Psychometrika, 18(4), 267-276. https://doi.org/10.1007/BF02289263
CR  - Tibshirani, R., Walther, G., &amp; Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293
CR  - Vinutha, H. P., Poornima, B., &amp; Sagar, B. M. (2018). Detection of outliers using interquartile range technique from intrusion dataset. In Information and decision sciences: Proceedings of the 6th international conference on ficta (pp. 511-518). Springer Singapore. https://doi.org/10.1007/978-981-10-7563-6_53
CR  - Wiroonsri, N. (2024). Clustering performance analysis using a new correlation-based cluster validity index. Pattern Recognition, 145, 109910. https://doi.org/10.1016/j.patcog.2023.109910
UR  - https://doi.org/10.53433/yyufbed.1612608
L1  - https://dergipark.org.tr/tr/download/article-file/4489584
ER  -