EN
TR
A New Method to Measure Clustering Performance and its Evaluation for Text Clustering
Abstract
In this study, an alternative method that can be used to measure clustering performance is proposed. In order to test the consistency of the proposed method, two different data sets consisting of Wikipedia abstracts were clustered with k-Means, k-Medoids and CLARANS methods and performance measurements were calculated with both the proposed method and the existing methods. The first data set containing only English summaries was tested by dividing it into different numbers of clusters. Since there was no prior knowledge of the content of the abstracts, the internal methods Silhouette, Calinski-Harabasz, and Davies-Bouldin were used to evaluate how accurately they were clustered. The second data set, which includes Wikipedia abstracts of 6 different languages, is divided into 6 clusters with clustering methods to classify the abstracts according to their language. Since the language of the summaries in the data set is known beforehand, the success of clustering could be measured by both internal and external methods. Since it is known that data compression algorithms compress a file with similar texts better than a file with different texts, it has been suggested that compression ratio can be used as an alternative evaluation metric. The proposed Compression Ratio Index (CRI), which can be calculated much faster than internal methods such as Silhouette, Calinski-Harabasz and Davies-Bouldin indexes, was tested with 4 different compression algorithms and yielded the same results with 9 external methods used in the second data set.
Keywords
References
- Abdalgader, K. (2017). Clustering Short Text using a Centroid-Based Lexical Clustering Algorithm. IAENG International Journal of Computer Science, 44(4).
- Alakuijala, J., Szabadka, Z. (2016). Brotli Compressed Data Format. Internet Engineering Task Force (IETF), RFC 7932, ISSN: 2070-1721
- Bolshakova, N., & Azuaje, F. (2003). Cluster validation techniques for genome expression data. Signal processing, 83(4), 825-833.
- Burrows, M., Wheeler, D. J. (1994). A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation.
- Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1-27.
- Cleary, J., & Witten, I. (1984). Data compression using adaptive coding and partial string matching. IEEE transactions on Communications, 32(4), 396-402.
- Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2), 224-227.
- Deutsch, P. (1996). DEFLATE Compressed Data Format Specification. version 1.3, RFC 1951 doi:10.17487/RFC1951.
Details
Primary Language
Turkish
Subjects
Engineering
Journal Section
Research Article
Publication Date
November 30, 2021
Submission Date
May 5, 2021
Acceptance Date
August 15, 2021
Published in Issue
Year 1970 Number: 27
APA
Aslanyürek, M., & Mesut, A. (2021). Kümeleme Performansını Ölçmek için Yeni Bir Yöntem ve Metin Kümeleme için Değerlendirmesi. Avrupa Bilim Ve Teknoloji Dergisi, 27, 53-65. https://doi.org/10.31590/ejosat.932938
Cited By
Derin Öğrenme Yardımıyla Aktif Termogramlar Üzerinden Meme Lezyonlarının Sınıflandırması
Süleyman Demirel Üniversitesi Fen Edebiyat Fakültesi Fen Dergisi
https://doi.org/10.29233/sdufeffd.1141226Single and Binary Performance Comparison of Data Compression Algorithms for Text Files
Bitlis Eren Üniversitesi Fen Bilimleri Dergisi
https://doi.org/10.17798/bitlisfen.1301546Minimizing Delay at Closely Spaced Signalized Intersections Through Green Time Ratio Optimization: A Hybrid Approach With K-Means Clustering and Genetic Algorithms
IEEE Access
https://doi.org/10.1109/ACCESS.2025.3549970Eğitim Fakülteleri Lisans Programlarındaki Değişimlerin YÖK ATLAS’taki Çeşitli Girdi Göstergeleri (2022-2024) Üzerinden İncelenmesi
Yaşadıkça Eğitim
https://doi.org/10.33308/26674874.2025392894Üniversite Öğrencilerinin Eleştirel Düşünme Profillerinin k-Means Kümeleme Algoritması ile Analizi
International Journal of Pure and Applied Sciences
https://doi.org/10.29132/ijpas.1675646G7 Ülkelerinin Ekonomik ve Sosyal Dinamikleri: Eğitim, Sağlık ve Ar-Ge Harcamalarının K-Means Algoritması ile İncelenmesi
Adnan Menderes Üniversitesi Sosyal Bilimler Enstitüsü Dergisi
https://doi.org/10.30803/adusobed.1679587Tekstil üretim süreçlerinde hata tahmini ve önlenmesi: Makine öğrenmesi tabanlı karar destek sistemi yaklaşımı
Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi
https://doi.org/10.65206/pajes.10594