An Investigation On The Execution Of The Document Clustering Process On Internet News

Metin Oktay Boz; Jale Bektaş

Report

İnternet Haberlerinde Belge Kümeleme Sürecinin Uygulanmasına Yönelik Bir Araştırma

Year 2024, Volume: 04 Issue: 02, 113 - 119, 31.12.2024

Metin Oktay Boz , Jale Bektaş

Abstract

İnternet haberlerini geçerli belgeler olarak tanıma üzerine birçok araştırma yapılmıştır. Bu çalışma, bir TF-IDF matrisi oluşturmak için metin madenciliği tekniklerinin uygulanmasını ve ardından en uygun küme sayısının otomatik olarak belirlenmesi ve kategorize edilmesini kapsamaktadır. Araştırma, çeşitli saygın yayıncıların makalelerini içeren Kullanıcı Katılımı veri seti ile internet haber makaleleri üzerindeki K-Means belge kümeleme algoritmasının etkisini incelemektedir. K-Means algoritmasını uygulamadan önce, TF-IDF matrisini hazırlamak için çeşitli ön işleme adımları gerçekleştirilmiştir. İçerik özniteliği verilerinin bulunmaması nedeniyle, belge kümeleme için açıklama özniteliği seçilmiştir. Ön işleme sırasında gereksiz ASCII sembolleri, noktalama işaretleri, satır sonları, e-postalar, etiketler, internet uzantıları, durak kelimeler ve 2 ile 21 karakter aralığının dışındaki kelimeler temizlenmiştir. Kelimeler aynı kökün farklı formlarını birleştirmek amacıyla köklerine indirgenmiştir. TF-IDF matrisinde en uygun küme sayısını belirlemek için Elbow yöntemi kullanılmış ve ardından sonuçlar en belirgin kelimeler ve kelime bulutları ile analiz edilmiştir. Sonuç olarak, 797, 408, 89, 364 ve 8755 belge sayısına sahip beş küme belirlenmiştir.

Keywords

K-Means, TF-IDF, Kümeleme, Döküman Kümeleme

Project Number

icesconf-00370040

References

[1] Adolfsson, A., Ackerman, M., & Brownstein, N. C. (2019). To cluster, or not to cluster: An analysis of clusterability methods. Pattern Recognition, 88, 13-26.
[2] Al-Anazi, S., AlMahmoud, H., & Al-Turaiki, I. (2016). Finding Similar Documents Using Different Clustering Techniques. Procedia Computer Science, 82, 28-34.
[3] Bezdan, T., Stoean, C., Naamany, A. A., Bacanin, N., Rashid, T. A., Zivkovic, M., & Venkatachalam, K. (2021). Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering. Mathematics, 9(16), 1929.
[4] Capó, M., Pérez, A., & Lozano, J. A. (2020). An efficient K-means clustering algorithm for tall data. Data Mining and Knowledge Discovery, 34, 776–811.
[5] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108.
[6] Hu, G., Zhou, S., Guan, J., & Hu, X. (2008). Towards effective document clustering: A constrained K-means based approach. Information Processing & Management, 44(4), 1397-1409.
[7] Janowski, S. (2020). Internet News Data with Readers Engagement. https://www.kaggle.com/datasets/szymonjanowski/internet-articles-data-with-users-engagement. [Accessed: 19-Dec-2023].
[8] Jarvis, R. A., & Patrick, E. A. (1973). Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers, C-22(11), 1025-1034.
[9] Liang, M., & Niu, T. (2022). Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs. Procedia Computer Science, 208, 460-470.
[10] Luo, C., Li, Y., & Chung, S. M. (2009). Text document clustering based on neighbors. Data & Knowledge Engineering, 68(11), 1271-1288.
[11] Mahdavi, M., & Abolhassani, H. (2009). Harmony K-means algorithm for document clustering. Data Mining and Knowledge Discovery, 18, 370–391.
[12] Minaee, S., Gao, J., Kalchbrenner, N., Cambria, E., Nikzad, N., & Chenaghlu, M. (2021). Deep Learning--based Text Classification: A Comprehensive Review. ACM Computing Surveys, 54(3), Article 62, 1–40.
[13] Mussabayev, R., Mladenovic, N., Jarboui, B., & Mussabayev, R. (2023). How to Use K-means for Big Data Clustering? Pattern Recognition, 137, 109269.
[14] Ridzuan, F., & Zainon, W. M. N. (2019). A Review on Data Cleansing Methods for Big Data. Procedia Computer Science, 161, 731-738.
[15] Sa, L. (2019). Text Clustering with K-Means. https://medium.com/@lucasdesa/text-clustering-with-k-means-a039d84a941b. [Accessed: 05-Jan-2024].
[16] Saha, B., & Srivastava, D. (2014). Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering (pp. 31 March 2014 - 04 April 2014). IEEE. https://doi.org/10.1109/ICDE.2014.6816764
[17] Thangaraj, M., & Sivakami, M. (2018). Text Classification Techniques: A Literature Review. Interdisciplinary Journal of Information, Knowledge, and Management, 13, 117-135.

An Investigation On The Execution Of The Document Clustering Process On Internet News

Year 2024, Volume: 04 Issue: 02, 113 - 119, 31.12.2024

Metin Oktay Boz , Jale Bektaş

Abstract

Numerous investigations have focused on recognizing Internet news as valid documents. This study encompasses the application of text mining techniques to generate a TF-IDF matrix and the subsequent automatic identification and categorization of an optimal number of clusters. The research examines the impact of K-Means document clustering on internet news articles, integrating the User Engagement dataset which includes articles from various esteemed publishers. Prior to implementing the K-Means algorithm, several preprocessing steps were undertaken to prepare the TF-IDF matrix. Due to the absence of the content attribute data, the description attribute was selected for document clustering. During preprocessing, extraneous ASCII symbols, punctuation marks, line breaks, emails, mentions, internet extensions, stopwords, and words outside the 2 to 21 character range were removed. Words were stemmed to consolidate different forms of the same root. The Elbow method was employed on the TF-IDF matrix to determine the optimal number of clusters, followed by an analysis of results using prominent words and word clouds. Ultimately, five clusters of document counts 797, 408, 89, 364, and 8755 were identified.

Keywords

K-Means, TF-IDF, Clustering, Document Clustering

Project Number

icesconf-00370040

References

[1] Adolfsson, A., Ackerman, M., & Brownstein, N. C. (2019). To cluster, or not to cluster: An analysis of clusterability methods. Pattern Recognition, 88, 13-26.
[2] Al-Anazi, S., AlMahmoud, H., & Al-Turaiki, I. (2016). Finding Similar Documents Using Different Clustering Techniques. Procedia Computer Science, 82, 28-34.
[3] Bezdan, T., Stoean, C., Naamany, A. A., Bacanin, N., Rashid, T. A., Zivkovic, M., & Venkatachalam, K. (2021). Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering. Mathematics, 9(16), 1929.
[4] Capó, M., Pérez, A., & Lozano, J. A. (2020). An efficient K-means clustering algorithm for tall data. Data Mining and Knowledge Discovery, 34, 776–811.
[5] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108.
[6] Hu, G., Zhou, S., Guan, J., & Hu, X. (2008). Towards effective document clustering: A constrained K-means based approach. Information Processing & Management, 44(4), 1397-1409.
[7] Janowski, S. (2020). Internet News Data with Readers Engagement. https://www.kaggle.com/datasets/szymonjanowski/internet-articles-data-with-users-engagement. [Accessed: 19-Dec-2023].
[8] Jarvis, R. A., & Patrick, E. A. (1973). Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers, C-22(11), 1025-1034.
[9] Liang, M., & Niu, T. (2022). Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs. Procedia Computer Science, 208, 460-470.
[10] Luo, C., Li, Y., & Chung, S. M. (2009). Text document clustering based on neighbors. Data & Knowledge Engineering, 68(11), 1271-1288.
[11] Mahdavi, M., & Abolhassani, H. (2009). Harmony K-means algorithm for document clustering. Data Mining and Knowledge Discovery, 18, 370–391.
[12] Minaee, S., Gao, J., Kalchbrenner, N., Cambria, E., Nikzad, N., & Chenaghlu, M. (2021). Deep Learning--based Text Classification: A Comprehensive Review. ACM Computing Surveys, 54(3), Article 62, 1–40.
[13] Mussabayev, R., Mladenovic, N., Jarboui, B., & Mussabayev, R. (2023). How to Use K-means for Big Data Clustering? Pattern Recognition, 137, 109269.
[14] Ridzuan, F., & Zainon, W. M. N. (2019). A Review on Data Cleansing Methods for Big Data. Procedia Computer Science, 161, 731-738.
[15] Sa, L. (2019). Text Clustering with K-Means. https://medium.com/@lucasdesa/text-clustering-with-k-means-a039d84a941b. [Accessed: 05-Jan-2024].
[16] Saha, B., & Srivastava, D. (2014). Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering (pp. 31 March 2014 - 04 April 2014). IEEE. https://doi.org/10.1109/ICDE.2014.6816764
[17] Thangaraj, M., & Sivakami, M. (2018). Text Classification Techniques: A Literature Review. Interdisciplinary Journal of Information, Knowledge, and Management, 13, 117-135.

There are 17 citations in total.

Details

Primary Language	English
Subjects	Computer Software
Journal Section	Research Article
Authors	Metin Oktay Boz 0009-0006-8620-7775 Jale Bektaş 0000-0002-8793-1486
Project Number	icesconf-00370040
Publication Date	December 31, 2024
Submission Date	June 27, 2024
Acceptance Date	July 18, 2024
Published in Issue	Year 2024 Volume: 04 Issue: 02

Cite

IEEE	M. O. Boz and J. Bektaş, “An Investigation On The Execution Of The Document Clustering Process On Internet News”, Researcher, vol. 04, no. 02, pp. 113–119, 2024.

Download Cover Image

Article Files

Full Text

The journal "Researcher: Social Sciences Studies" (RSSS), which started its publication life in 2013, continues its activities under the name of "Researcher" as of August 2020, under Ankara Bilim University.
It is an internationally indexed, nationally refereed, scientific and electronic journal that publishes original research articles aiming to contribute to the fields of Engineering and Science in 2021 and beyond.
The journal is published twice a year, except for special issues.
Candidate articles submitted for publication in the journal can be written in Turkish and English. Articles submitted to the journal must not have been previously published in another journal or sent to another journal for publication.