ACADEMIC TEXT CLUSTERING USING NATURAL LANGUAGE PROCESSING

Salimkan Fatma Taşkıran; Ersin Kaya

doi:10.36306/konjes.1081213

Research Article

Doğal Dil İşleme ile Akademik Metin Kümeleme

Year 2022, Volume: 10 , 41 - 51, 16.12.2022

Salimkan Fatma Taşkıran , Ersin Kaya

https://doi.org/10.36306/konjes.1081213

Abstract

Günümüzde verilere ulaşmak çok kolaylaşmıştır. Ancak bu verileri verimli bir şekilde kullanmak için onlardan doğru bilgileri çıkarmak gerekir. İhtiyaç duyulan bilgiye kısa sürede ulaşabilmek için bu verilerin kategorilere ayrılması büyük kolaylık sağlamaktadır. Akademik alanda araştırma yapılırken genellikle makale, bildiri veya tez çalışması gibi metin tabanlı veriler kullanılmaktadır. Bu metin tabanlı verilerden ihtiyacımız olan doğru bilgiyi elde etmek için doğal dil işleme ve makine öğrenmesi yöntemleri kullanılmaktadır. Bu çalışmada akademik makalelerin özetleri kümelenmiştir. Akademik makale özetlerinden alınan metin verileri, doğal dil işleme teknikleri kullanılarak önceden işlenir. Word2Vec ve BERT ile vektörize edilen kelime temsilleri, dört farklı kümeleme algoritması ile kümelenmiştir.

Keywords

Doğal Dil İşleme, Makine Öğrenmesi, Metin Temsili

References

Adalı, E. (2012). Doğal Dil İşleme. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 5(2).
Aggarwal, C. C., & Zhai, C. (2012). A survey of text clustering algorithms. In Mining text data (pp. 77-128): Springer.
Alexandrov, M., Gelbukh, A., & Rosso, P. (2005). An approach to clustering abstracts. Paper presented at the International Conference on Application of Natural Language to Information Systems.
Amasyali, M. F., Balc1, S., Mete, E., & Varl1, E. N. (2012). Türkçe Metinlerin Sınıflandırılmasında Metin Temsil Yöntemlerinin Performans Karşılaştırılması / A Comparison of Text Representation Methods for Turkish Text Classification.
Amasyalı, M. F., & Diri, B. (2006). Automatic Turkish text categorization in terms of author, genre and gender. International Conference on Application of Natural Language to Information Systems,
Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2), 49-60.
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243-256.
Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2003). Distributional Word Clusters vs. Words for Text Categorization. J. Mach. Learn. Res., 3, 1183-1208.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2016). Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781, 2, 1.
Çilden, E. K. (2006). Stemming Turkish Words Using Snowball. https://snowballstem.org/ Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dhar, A., Mukherjee, H., Dash, N. S., & Roy, K. (2021). Text categorization: past and present. Artificial Intelligence Review, 54(4), 3007-3054.
Eryigit, G., & Adali, E. (2003). AN AFFIX STRIPPING MORPHOLOGICAL ANALYZER FOR TURKISH.
Eryigit, G., & Oflazer, K. (2006). Statistical Dependency Parsing for Turkish. EACL
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
Kilinç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., & Borandag, E. (2017). TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43, 174 - 185.
Köksal, A. (2018). Turkish Pre-trained Word2Vec Model. https://github.com/akoksal/Turkish-Word2Vec
Li, C., Lu, Y., Wu, J., Zhang, Y., Xia, Z., Wang, T., . . . Guo, J. (2018). LDA meets Word2Vec: a novel model for academic abstract clustering. Paper presented at the Companion proceedings of the the web conference 2018.
Makagonov, P., Alexandrov, M., & Gelbukh, A. (2004). Clustering abstracts instead of full texts. Paper presented at the International Conference on Text, Speech and Dialogue.
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, 5(4), 1093-1113.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546.
Onan, A., Bulut, H., & Korukoglu, S. (2017). An improved ant algorithm with LDA-based representation for text document clustering. Journal of Information Science, 43(2), 275-292.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Paper presented at the Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP).
Pinto, D., Rosso, P., & Jiménez-Salazar, H. (2011). A self-enriching methodology for clustering narrow domain short texts. The Computer Journal, 54(7), 1148-1165.
Popova, S., Danilova, V., & Egorov, A. (2014). Clustering narrow-domain short texts using k-means, linguistic patterns and lsi. Paper presented at the International Conference on Analysis of Images, Social Networks and Texts.
Premalatha, K., & Natarajan, A. (2010). A literature review on document clustering. Information Technology Journal, 9(5), 993-1002.
Rodriguez, M. Z., Comin, C. H., Casanova, D., Bruno, O. M., Amancio, D. R., Costa, L. d. F., & Rodrigues, F. A. (2019). Clustering algorithms: A comparative approach. PloS one, 14(1), e0210236.
Romeo, S., Greco, S., & Tagarelli, A. (2014). Multi-topic and multilingual document clustering via tensor modeling.
Tajbakhsh, M. S., & Bagherzadeh, J. (2019). Semantic knowledge LDA with topic vector for recommending hashtags: Twitter use case. Intelligent Data Analysis, 23(3), 609-622.
Torunoğlu, D., Çakirman, E., Ganiz, M. C., Akyokuş, S., & Gürbüz, M. Z. (2011). Analysis of preprocessing methods on classification of Turkish texts. 2011 International Symposium on Innovations in Intelligent Systems and Applications
Tuncelli, O., & Özdemir, B. (2019). Turkish Stemmer for Python. https://github.com/otuncelli/turkish-stemmer-python
Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104-112.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Weißer, T., Saßmannshausen, T., Ohrndorf, D., Burggräf, P., & Wagner, J. (2020). A clustering approach for topic filtering within systematic literature reviews. MethodsX, 7, 100831.
Yang, J., & Park, S.-Y. (2002). Email categorization using fast machine learning algorithms. International Conference on Discovery Science

ACADEMIC TEXT CLUSTERING USING NATURAL LANGUAGE PROCESSING

Year 2022, Volume: 10 , 41 - 51, 16.12.2022

Salimkan Fatma Taşkıran , Ersin Kaya

https://doi.org/10.36306/konjes.1081213

Abstract

Accessing data is very easy nowadays. However, to use these data in an efficient way, it is necessary to get the right information from them. Categorizing these data in order to reach the needed information in a short time provides great convenience. All the more, while doing research in the academic field, text-based data such as articles, papers, or thesis studies are generally used. Natural language processing and machine learning methods are used to get the right information we need from these text-based data. In this study, abstracts of academic papers are clustered. Text data from academic paper abstracts are preprocessed using natural language processing techniques. A vectorized word representation extracted from preprocessed data with Word2Vec and BERT word embeddings and representations are clustered with four clustering algorithms.

Keywords

Natural Language Processing, Machine Learning, Text Representation

References

Adalı, E. (2012). Doğal Dil İşleme. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 5(2).
Aggarwal, C. C., & Zhai, C. (2012). A survey of text clustering algorithms. In Mining text data (pp. 77-128): Springer.
Alexandrov, M., Gelbukh, A., & Rosso, P. (2005). An approach to clustering abstracts. Paper presented at the International Conference on Application of Natural Language to Information Systems.
Amasyali, M. F., Balc1, S., Mete, E., & Varl1, E. N. (2012). Türkçe Metinlerin Sınıflandırılmasında Metin Temsil Yöntemlerinin Performans Karşılaştırılması / A Comparison of Text Representation Methods for Turkish Text Classification.
Amasyalı, M. F., & Diri, B. (2006). Automatic Turkish text categorization in terms of author, genre and gender. International Conference on Application of Natural Language to Information Systems,
Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2), 49-60.
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243-256.
Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2003). Distributional Word Clusters vs. Words for Text Categorization. J. Mach. Learn. Res., 3, 1183-1208.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2016). Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781, 2, 1.
Çilden, E. K. (2006). Stemming Turkish Words Using Snowball. https://snowballstem.org/ Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dhar, A., Mukherjee, H., Dash, N. S., & Roy, K. (2021). Text categorization: past and present. Artificial Intelligence Review, 54(4), 3007-3054.
Eryigit, G., & Adali, E. (2003). AN AFFIX STRIPPING MORPHOLOGICAL ANALYZER FOR TURKISH.
Eryigit, G., & Oflazer, K. (2006). Statistical Dependency Parsing for Turkish. EACL
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
Kilinç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., & Borandag, E. (2017). TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43, 174 - 185.
Köksal, A. (2018). Turkish Pre-trained Word2Vec Model. https://github.com/akoksal/Turkish-Word2Vec
Li, C., Lu, Y., Wu, J., Zhang, Y., Xia, Z., Wang, T., . . . Guo, J. (2018). LDA meets Word2Vec: a novel model for academic abstract clustering. Paper presented at the Companion proceedings of the the web conference 2018.
Makagonov, P., Alexandrov, M., & Gelbukh, A. (2004). Clustering abstracts instead of full texts. Paper presented at the International Conference on Text, Speech and Dialogue.
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, 5(4), 1093-1113.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546.
Onan, A., Bulut, H., & Korukoglu, S. (2017). An improved ant algorithm with LDA-based representation for text document clustering. Journal of Information Science, 43(2), 275-292.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Paper presented at the Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP).
Pinto, D., Rosso, P., & Jiménez-Salazar, H. (2011). A self-enriching methodology for clustering narrow domain short texts. The Computer Journal, 54(7), 1148-1165.
Popova, S., Danilova, V., & Egorov, A. (2014). Clustering narrow-domain short texts using k-means, linguistic patterns and lsi. Paper presented at the International Conference on Analysis of Images, Social Networks and Texts.
Premalatha, K., & Natarajan, A. (2010). A literature review on document clustering. Information Technology Journal, 9(5), 993-1002.
Rodriguez, M. Z., Comin, C. H., Casanova, D., Bruno, O. M., Amancio, D. R., Costa, L. d. F., & Rodrigues, F. A. (2019). Clustering algorithms: A comparative approach. PloS one, 14(1), e0210236.
Romeo, S., Greco, S., & Tagarelli, A. (2014). Multi-topic and multilingual document clustering via tensor modeling.
Tajbakhsh, M. S., & Bagherzadeh, J. (2019). Semantic knowledge LDA with topic vector for recommending hashtags: Twitter use case. Intelligent Data Analysis, 23(3), 609-622.
Torunoğlu, D., Çakirman, E., Ganiz, M. C., Akyokuş, S., & Gürbüz, M. Z. (2011). Analysis of preprocessing methods on classification of Turkish texts. 2011 International Symposium on Innovations in Intelligent Systems and Applications
Tuncelli, O., & Özdemir, B. (2019). Turkish Stemmer for Python. https://github.com/otuncelli/turkish-stemmer-python
Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104-112.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Weißer, T., Saßmannshausen, T., Ohrndorf, D., Burggräf, P., & Wagner, J. (2020). A clustering approach for topic filtering within systematic literature reviews. MethodsX, 7, 100831.
Yang, J., & Park, S.-Y. (2002). Email categorization using fast machine learning algorithms. International Conference on Discovery Science

There are 36 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Research Article
Authors	Salimkan Fatma Taşkıran 0000-0002-8290-6759 Ersin Kaya 0000-0001-5668-5078
Publication Date	December 16, 2022
Submission Date	March 1, 2022
Acceptance Date	November 16, 2022
Published in Issue	Year 2022 Volume: 10

Cite

IEEE	S. F. Taşkıran and E. Kaya, “ACADEMIC TEXT CLUSTERING USING NATURAL LANGUAGE PROCESSING”, KONJES, vol. 10, pp. 41–51, 2022, doi: 10.36306/konjes.1081213.

Download Cover Image

Article Files

Full Text