Research Article

A New Method to Measure Clustering Performance and its Evaluation for Text Clustering

Number: 27 November 30, 2021
EN TR

A New Method to Measure Clustering Performance and its Evaluation for Text Clustering

Abstract

In this study, an alternative method that can be used to measure clustering performance is proposed. In order to test the consistency of the proposed method, two different data sets consisting of Wikipedia abstracts were clustered with k-Means, k-Medoids and CLARANS methods and performance measurements were calculated with both the proposed method and the existing methods. The first data set containing only English summaries was tested by dividing it into different numbers of clusters. Since there was no prior knowledge of the content of the abstracts, the internal methods Silhouette, Calinski-Harabasz, and Davies-Bouldin were used to evaluate how accurately they were clustered. The second data set, which includes Wikipedia abstracts of 6 different languages, is divided into 6 clusters with clustering methods to classify the abstracts according to their language. Since the language of the summaries in the data set is known beforehand, the success of clustering could be measured by both internal and external methods. Since it is known that data compression algorithms compress a file with similar texts better than a file with different texts, it has been suggested that compression ratio can be used as an alternative evaluation metric. The proposed Compression Ratio Index (CRI), which can be calculated much faster than internal methods such as Silhouette, Calinski-Harabasz and Davies-Bouldin indexes, was tested with 4 different compression algorithms and yielded the same results with 9 external methods used in the second data set.

Keywords

References

  1. Abdalgader, K. (2017). Clustering Short Text using a Centroid-Based Lexical Clustering Algorithm. IAENG International Journal of Computer Science, 44(4).
  2. Alakuijala, J., Szabadka, Z. (2016). Brotli Compressed Data Format. Internet Engineering Task Force (IETF), RFC 7932, ISSN: 2070-1721
  3. Bolshakova, N., & Azuaje, F. (2003). Cluster validation techniques for genome expression data. Signal processing, 83(4), 825-833.
  4. Burrows, M., Wheeler, D. J. (1994). A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation.
  5. Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1-27.
  6. Cleary, J., & Witten, I. (1984). Data compression using adaptive coding and partial string matching. IEEE transactions on Communications, 32(4), 396-402.
  7. Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2), 224-227.
  8. Deutsch, P. (1996). DEFLATE Compressed Data Format Specification. version 1.3, RFC 1951 doi:10.17487/RFC1951.

Details

Primary Language

Turkish

Subjects

Engineering

Journal Section

Research Article

Publication Date

November 30, 2021

Submission Date

May 5, 2021

Acceptance Date

August 15, 2021

Published in Issue

Year 1970 Number: 27

APA
Aslanyürek, M., & Mesut, A. (2021). Kümeleme Performansını Ölçmek için Yeni Bir Yöntem ve Metin Kümeleme için Değerlendirmesi. Avrupa Bilim Ve Teknoloji Dergisi, 27, 53-65. https://doi.org/10.31590/ejosat.932938

Cited By