Research Article

Text Clustering with Pre-Trained Models: BERT, RoBERTa, ALBERT and MPNet

Volume: 5 Number: 2 December 30, 2024
EN

Text Clustering with Pre-Trained Models: BERT, RoBERTa, ALBERT and MPNet

Abstract

Text clustering is the process of collecting similar sentences in texts of variable size in the same group. Text clustering methods are an important area used for data analysis and for extracting information from these data. Many studies have been carried out in this area using different approaches and methods. In this study, the results of BERT (Bidirectional encoder representations from transformers), ROBERTa (Robustly optimized BERT pretraining approach), ALBERT(A lite BERT) and MPNet (Masked and permuted pre-training for language understanding) models, which are pre-trained models, and the TF-IDF (Term frequency-inverse document frequency) method, which is traditional statistical feature extraction, were compared while performing text representation. After the feature extraction stage, performance measurements were made by clustering with K-means, BIRCH (Balanced iterative reducing and clustering using hierarchies), Agglomerative clustering and Mini-batch K-means algorithms. When the measurements are evaluated, it has been reported that the pre-trained models give superior clustering results compared to the classical models.

Keywords

References

  1. Ahmed, A., Boyce, E., & Pfeffer, J. (2007). The structure of online discussion groups: A case study. Management Science (pp. 1432-1445).
  2. Aggarwal CC & Zhai C. (2012). A survey of text clustering algorithms in mining text data (pp. 77–128). New York, London: Springer.
  3. Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age. Cambridge University Press. https://doi.org/10.1017/9781316761380
  4. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research (pp. 993-1022).
  5. Bishop C.M. (2006). Pattern Recognition and Machine Learning (pp. 128-129). Newyork, USA:Springer.
  6. Boult, T., DeRose, T., Czerwinski, M., & Smith, B. (2003). A comparison of clustering algorithms for gene expression data. Pacific Symposium on Biocomputing (pp. 535-546).
  7. Caruana, G., & Li, M. (2012). A survey of emerging approaches to spam filtering. ACM Computing Surveys, 44(2), 1–27. https://doi.org/10.1145/2089125.2089129
  8. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. URL: https://ar5iv.labs.arxiv.org/html/1705.02364 (accessed date: March 23, 2023).

Details

Primary Language

English

Subjects

Computer Software

Journal Section

Research Article

Publication Date

December 30, 2024

Submission Date

November 1, 2024

Acceptance Date

December 6, 2024

Published in Issue

Year 2024 Volume: 5 Number: 2

APA
Alagöz, O., & Uçkan, T. (2024). Text Clustering with Pre-Trained Models: BERT, RoBERTa, ALBERT and MPNet. NATURENGS, 5(2), 37-46. https://doi.org/10.46572/naturengs.1577517