Research Article
BibTex RIS Cite
Year 2024, Volume: 5 Issue: 2, 37 - 46, 30.12.2024
https://doi.org/10.46572/naturengs.1577517

Abstract

References

  • Ahmed, A., Boyce, E., & Pfeffer, J. (2007). The structure of online discussion groups: A case study. Management Science (pp. 1432-1445).
  • Aggarwal CC & Zhai C. (2012). A survey of text clustering algorithms in mining text data (pp. 77–128). New York, London: Springer.
  • Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age. Cambridge University Press. https://doi.org/10.1017/9781316761380
  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research (pp. 993-1022).
  • Bishop C.M. (2006). Pattern Recognition and Machine Learning (pp. 128-129). Newyork, USA:Springer.
  • Boult, T., DeRose, T., Czerwinski, M., & Smith, B. (2003). A comparison of clustering algorithms for gene expression data. Pacific Symposium on Biocomputing (pp. 535-546).
  • Caruana, G., & Li, M. (2012). A survey of emerging approaches to spam filtering. ACM Computing Surveys, 44(2), 1–27. https://doi.org/10.1145/2089125.2089129
  • Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. URL: https://ar5iv.labs.arxiv.org/html/1705.02364 (accessed date: March 23, 2023).
  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science (pp. 391-407).
  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. URL: https://arxiv.org/abs/1810.04805 (accessed date: May 12, 2023).
  • Dhillon, I., Mallela, S., & Modha, D. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning (pp. 143-175).
  • Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In Proceedings of the 20th international conference on Machine learning (ICML-03) (pp. 147-153).
  • Fahim, M. (2021). BERT - In depth understanding. URL: https://www.kaggle.com/code/mdfahimreshm/bert-in-depth-understanding (accessed date: January 17, 2023).
  • Guan R, Zhang H, Liang Y, Giunchiglia F, Huang L, Feng X. (2020). Deep feature-based text clustering and its explanation. IEEE Transactions on Knowledge and Data Engineering. URL: https://ieeexplore.ieee.org/document/9215004 (accessed date: November 10, 2023).
  • Gulli, A. (2015). AG's corpus of news articles URL: http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html (accessed date: December 30, 2022).
  • Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. URL: https://arxiv.org/abs/1801.06146 (accessed date: April 08, 2023).
  • Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), 651-666.
  • Kodinariya, T. M., & Makwana, P. R. (2013). Review on determining number of Cluster in K-Means Clustering. International Journal, 1(6), 90-95.
  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. URL: https://arxiv.org/abs/1909.11942 (accessed date: December 22, 2022).
  • Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach.URL: https://arxiv.org/abs/1907.11692 (accessed date: January 17, 2023).
  • Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers. https://doi.org/10.2200/S00416ED1V01Y201204HLT016
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
  • Mutlu, H. B., Durmaz, F., Yücel, N., Cengil, E., & Yıldırım, M. (2023). Prediction of maternal health risk with traditional machine learning methods. Naturengs, 4(1), 16-23.
  • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018). Language models are unsupervised multitask learners. OpenAI. URL: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed date: September 17, 2022).
  • Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. URL: https://arxiv.org/abs/1908.10084 (accessed date: February 03, 2023).
  • Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management (pp. 513-523).
  • Sidorov, G. (2014). Term frequency–inverse document frequency (TF-IDF) as a tool of content-based recommendation systems. International Journal of Computer Science and Information Security, 12(1), 44-51.
  • Song, K., Tan, X., Qin, T., Lu, J., & Liu, T. Y. (2020). Mpnet: Masked and permuted pre-training for language understanding.URL: https://arxiv.org/pdf/2004.09297 (accessed date: June 18, 2023).
  • Subakti, A., Murfi, H. & Hariadi, N. (2022). The performance of BERT as data representation of text clustering. URL: https://doi.org/10.1186/s40537-022-00564-9 (accessed date: March 25, 2023).
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed date: December 17, 2022).
  • Wang, A. L., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2020). GLUE benchmark: Evaluating large-scale language understanding systems. URL: https://arxiv.org/abs/1804.07461 (accessed date: October 02, 2022).
  • Xu, R., Wunsch, D., & Hu, J. (2015). Survey of clustering algorithms. IEEE Transactions on Neural Networks and Learning Systems, 26(11), 2264-2281.
  • Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pp. 103-114, 293-304..
  • Zhang, Y., Zha, H., & Lai, J. (2002). Text clustering based on the latent Dirichlet allocation model. Advances in Neural Information Processing Systems (pp. 1049-1056).
  • Zhang, Y., Wallace, B. C., & Li, M. (2015). A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. URL: https://arxiv.org/abs/1510.03820 (accessed date: February 19, 2023).
  • Zhao, Y., Chen, W., & Liu, X. (2019). A survey of clustering algorithms for text data. ACM Computing Surveys, 52(5), 1-45.

Text Clustering with Pre-Trained Models: BERT, RoBERTa, ALBERT and MPNet

Year 2024, Volume: 5 Issue: 2, 37 - 46, 30.12.2024
https://doi.org/10.46572/naturengs.1577517

Abstract

Text clustering is the process of collecting similar sentences in texts of variable size in the same group. Text clustering methods are an important area used for data analysis and for extracting information from these data. Many studies have been carried out in this area using different approaches and methods. In this study, the results of BERT (Bidirectional encoder representations from transformers), ROBERTa (Robustly optimized BERT pretraining approach), ALBERT(A lite BERT) and MPNet (Masked and permuted pre-training for language understanding) models, which are pre-trained models, and the TF-IDF (Term frequency-inverse document frequency) method, which is traditional statistical feature extraction, were compared while performing text representation. After the feature extraction stage, performance measurements were made by clustering with K-means, BIRCH (Balanced iterative reducing and clustering using hierarchies), Agglomerative clustering and Mini-batch K-means algorithms. When the measurements are evaluated, it has been reported that the pre-trained models give superior clustering results compared to the classical models.

References

  • Ahmed, A., Boyce, E., & Pfeffer, J. (2007). The structure of online discussion groups: A case study. Management Science (pp. 1432-1445).
  • Aggarwal CC & Zhai C. (2012). A survey of text clustering algorithms in mining text data (pp. 77–128). New York, London: Springer.
  • Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age. Cambridge University Press. https://doi.org/10.1017/9781316761380
  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research (pp. 993-1022).
  • Bishop C.M. (2006). Pattern Recognition and Machine Learning (pp. 128-129). Newyork, USA:Springer.
  • Boult, T., DeRose, T., Czerwinski, M., & Smith, B. (2003). A comparison of clustering algorithms for gene expression data. Pacific Symposium on Biocomputing (pp. 535-546).
  • Caruana, G., & Li, M. (2012). A survey of emerging approaches to spam filtering. ACM Computing Surveys, 44(2), 1–27. https://doi.org/10.1145/2089125.2089129
  • Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. URL: https://ar5iv.labs.arxiv.org/html/1705.02364 (accessed date: March 23, 2023).
  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science (pp. 391-407).
  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. URL: https://arxiv.org/abs/1810.04805 (accessed date: May 12, 2023).
  • Dhillon, I., Mallela, S., & Modha, D. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning (pp. 143-175).
  • Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In Proceedings of the 20th international conference on Machine learning (ICML-03) (pp. 147-153).
  • Fahim, M. (2021). BERT - In depth understanding. URL: https://www.kaggle.com/code/mdfahimreshm/bert-in-depth-understanding (accessed date: January 17, 2023).
  • Guan R, Zhang H, Liang Y, Giunchiglia F, Huang L, Feng X. (2020). Deep feature-based text clustering and its explanation. IEEE Transactions on Knowledge and Data Engineering. URL: https://ieeexplore.ieee.org/document/9215004 (accessed date: November 10, 2023).
  • Gulli, A. (2015). AG's corpus of news articles URL: http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html (accessed date: December 30, 2022).
  • Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. URL: https://arxiv.org/abs/1801.06146 (accessed date: April 08, 2023).
  • Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), 651-666.
  • Kodinariya, T. M., & Makwana, P. R. (2013). Review on determining number of Cluster in K-Means Clustering. International Journal, 1(6), 90-95.
  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. URL: https://arxiv.org/abs/1909.11942 (accessed date: December 22, 2022).
  • Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach.URL: https://arxiv.org/abs/1907.11692 (accessed date: January 17, 2023).
  • Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers. https://doi.org/10.2200/S00416ED1V01Y201204HLT016
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
  • Mutlu, H. B., Durmaz, F., Yücel, N., Cengil, E., & Yıldırım, M. (2023). Prediction of maternal health risk with traditional machine learning methods. Naturengs, 4(1), 16-23.
  • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018). Language models are unsupervised multitask learners. OpenAI. URL: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed date: September 17, 2022).
  • Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. URL: https://arxiv.org/abs/1908.10084 (accessed date: February 03, 2023).
  • Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management (pp. 513-523).
  • Sidorov, G. (2014). Term frequency–inverse document frequency (TF-IDF) as a tool of content-based recommendation systems. International Journal of Computer Science and Information Security, 12(1), 44-51.
  • Song, K., Tan, X., Qin, T., Lu, J., & Liu, T. Y. (2020). Mpnet: Masked and permuted pre-training for language understanding.URL: https://arxiv.org/pdf/2004.09297 (accessed date: June 18, 2023).
  • Subakti, A., Murfi, H. & Hariadi, N. (2022). The performance of BERT as data representation of text clustering. URL: https://doi.org/10.1186/s40537-022-00564-9 (accessed date: March 25, 2023).
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed date: December 17, 2022).
  • Wang, A. L., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2020). GLUE benchmark: Evaluating large-scale language understanding systems. URL: https://arxiv.org/abs/1804.07461 (accessed date: October 02, 2022).
  • Xu, R., Wunsch, D., & Hu, J. (2015). Survey of clustering algorithms. IEEE Transactions on Neural Networks and Learning Systems, 26(11), 2264-2281.
  • Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pp. 103-114, 293-304..
  • Zhang, Y., Zha, H., & Lai, J. (2002). Text clustering based on the latent Dirichlet allocation model. Advances in Neural Information Processing Systems (pp. 1049-1056).
  • Zhang, Y., Wallace, B. C., & Li, M. (2015). A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. URL: https://arxiv.org/abs/1510.03820 (accessed date: February 19, 2023).
  • Zhao, Y., Chen, W., & Liu, X. (2019). A survey of clustering algorithms for text data. ACM Computing Surveys, 52(5), 1-45.
There are 38 citations in total.

Details

Primary Language English
Subjects Computer Software
Journal Section Research Articles
Authors

Oğuzhan Alagöz 0000-0002-7089-3196

Taner Uçkan 0000-0001-5385-6775

Publication Date December 30, 2024
Submission Date November 1, 2024
Acceptance Date December 6, 2024
Published in Issue Year 2024 Volume: 5 Issue: 2

Cite

APA Alagöz, O., & Uçkan, T. (2024). Text Clustering with Pre-Trained Models: BERT, RoBERTa, ALBERT and MPNet. NATURENGS, 5(2), 37-46. https://doi.org/10.46572/naturengs.1577517