An Investigation On The Execution Of The Document Clustering Process On Internet News

Metin Oktay Boz; Jale Bektaş

Rapor

İnternet Haberlerinde Belge Kümeleme Sürecinin Uygulanmasına Yönelik Bir Araştırma

Yıl 2024, Cilt: 04 Sayı: 02, 113 - 119, 31.12.2024

Metin Oktay Boz , Jale Bektaş

Öz

İnternet haberlerini geçerli belgeler olarak tanıma üzerine birçok araştırma yapılmıştır. Bu çalışma, bir TF-IDF matrisi oluşturmak için metin madenciliği tekniklerinin uygulanmasını ve ardından en uygun küme sayısının otomatik olarak belirlenmesi ve kategorize edilmesini kapsamaktadır. Araştırma, çeşitli saygın yayıncıların makalelerini içeren Kullanıcı Katılımı veri seti ile internet haber makaleleri üzerindeki K-Means belge kümeleme algoritmasının etkisini incelemektedir. K-Means algoritmasını uygulamadan önce, TF-IDF matrisini hazırlamak için çeşitli ön işleme adımları gerçekleştirilmiştir. İçerik özniteliği verilerinin bulunmaması nedeniyle, belge kümeleme için açıklama özniteliği seçilmiştir. Ön işleme sırasında gereksiz ASCII sembolleri, noktalama işaretleri, satır sonları, e-postalar, etiketler, internet uzantıları, durak kelimeler ve 2 ile 21 karakter aralığının dışındaki kelimeler temizlenmiştir. Kelimeler aynı kökün farklı formlarını birleştirmek amacıyla köklerine indirgenmiştir. TF-IDF matrisinde en uygun küme sayısını belirlemek için Elbow yöntemi kullanılmış ve ardından sonuçlar en belirgin kelimeler ve kelime bulutları ile analiz edilmiştir. Sonuç olarak, 797, 408, 89, 364 ve 8755 belge sayısına sahip beş küme belirlenmiştir.

Anahtar Kelimeler

K-Means, TF-IDF, Kümeleme, Döküman Kümeleme

Proje Numarası

icesconf-00370040

Kaynakça

[1] Adolfsson, A., Ackerman, M., & Brownstein, N. C. (2019). To cluster, or not to cluster: An analysis of clusterability methods. Pattern Recognition, 88, 13-26.
[2] Al-Anazi, S., AlMahmoud, H., & Al-Turaiki, I. (2016). Finding Similar Documents Using Different Clustering Techniques. Procedia Computer Science, 82, 28-34.
[3] Bezdan, T., Stoean, C., Naamany, A. A., Bacanin, N., Rashid, T. A., Zivkovic, M., & Venkatachalam, K. (2021). Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering. Mathematics, 9(16), 1929.
[4] Capó, M., Pérez, A., & Lozano, J. A. (2020). An efficient K-means clustering algorithm for tall data. Data Mining and Knowledge Discovery, 34, 776–811.
[5] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108.
[6] Hu, G., Zhou, S., Guan, J., & Hu, X. (2008). Towards effective document clustering: A constrained K-means based approach. Information Processing & Management, 44(4), 1397-1409.
[7] Janowski, S. (2020). Internet News Data with Readers Engagement. https://www.kaggle.com/datasets/szymonjanowski/internet-articles-data-with-users-engagement. [Accessed: 19-Dec-2023].
[8] Jarvis, R. A., & Patrick, E. A. (1973). Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers, C-22(11), 1025-1034.
[9] Liang, M., & Niu, T. (2022). Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs. Procedia Computer Science, 208, 460-470.
[10] Luo, C., Li, Y., & Chung, S. M. (2009). Text document clustering based on neighbors. Data & Knowledge Engineering, 68(11), 1271-1288.
[11] Mahdavi, M., & Abolhassani, H. (2009). Harmony K-means algorithm for document clustering. Data Mining and Knowledge Discovery, 18, 370–391.
[12] Minaee, S., Gao, J., Kalchbrenner, N., Cambria, E., Nikzad, N., & Chenaghlu, M. (2021). Deep Learning--based Text Classification: A Comprehensive Review. ACM Computing Surveys, 54(3), Article 62, 1–40.
[13] Mussabayev, R., Mladenovic, N., Jarboui, B., & Mussabayev, R. (2023). How to Use K-means for Big Data Clustering? Pattern Recognition, 137, 109269.
[14] Ridzuan, F., & Zainon, W. M. N. (2019). A Review on Data Cleansing Methods for Big Data. Procedia Computer Science, 161, 731-738.
[15] Sa, L. (2019). Text Clustering with K-Means. https://medium.com/@lucasdesa/text-clustering-with-k-means-a039d84a941b. [Accessed: 05-Jan-2024].
[16] Saha, B., & Srivastava, D. (2014). Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering (pp. 31 March 2014 - 04 April 2014). IEEE. https://doi.org/10.1109/ICDE.2014.6816764
[17] Thangaraj, M., & Sivakami, M. (2018). Text Classification Techniques: A Literature Review. Interdisciplinary Journal of Information, Knowledge, and Management, 13, 117-135.

An Investigation On The Execution Of The Document Clustering Process On Internet News

Yıl 2024, Cilt: 04 Sayı: 02, 113 - 119, 31.12.2024

Metin Oktay Boz , Jale Bektaş

Öz

Numerous investigations have focused on recognizing Internet news as valid documents. This study encompasses the application of text mining techniques to generate a TF-IDF matrix and the subsequent automatic identification and categorization of an optimal number of clusters. The research examines the impact of K-Means document clustering on internet news articles, integrating the User Engagement dataset which includes articles from various esteemed publishers. Prior to implementing the K-Means algorithm, several preprocessing steps were undertaken to prepare the TF-IDF matrix. Due to the absence of the content attribute data, the description attribute was selected for document clustering. During preprocessing, extraneous ASCII symbols, punctuation marks, line breaks, emails, mentions, internet extensions, stopwords, and words outside the 2 to 21 character range were removed. Words were stemmed to consolidate different forms of the same root. The Elbow method was employed on the TF-IDF matrix to determine the optimal number of clusters, followed by an analysis of results using prominent words and word clouds. Ultimately, five clusters of document counts 797, 408, 89, 364, and 8755 were identified.

Anahtar Kelimeler

K-Means, TF-IDF, Clustering, Document Clustering

Proje Numarası

icesconf-00370040

Kaynakça

[1] Adolfsson, A., Ackerman, M., & Brownstein, N. C. (2019). To cluster, or not to cluster: An analysis of clusterability methods. Pattern Recognition, 88, 13-26.
[2] Al-Anazi, S., AlMahmoud, H., & Al-Turaiki, I. (2016). Finding Similar Documents Using Different Clustering Techniques. Procedia Computer Science, 82, 28-34.
[3] Bezdan, T., Stoean, C., Naamany, A. A., Bacanin, N., Rashid, T. A., Zivkovic, M., & Venkatachalam, K. (2021). Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering. Mathematics, 9(16), 1929.
[4] Capó, M., Pérez, A., & Lozano, J. A. (2020). An efficient K-means clustering algorithm for tall data. Data Mining and Knowledge Discovery, 34, 776–811.
[5] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108.
[6] Hu, G., Zhou, S., Guan, J., & Hu, X. (2008). Towards effective document clustering: A constrained K-means based approach. Information Processing & Management, 44(4), 1397-1409.
[7] Janowski, S. (2020). Internet News Data with Readers Engagement. https://www.kaggle.com/datasets/szymonjanowski/internet-articles-data-with-users-engagement. [Accessed: 19-Dec-2023].
[8] Jarvis, R. A., & Patrick, E. A. (1973). Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers, C-22(11), 1025-1034.
[9] Liang, M., & Niu, T. (2022). Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs. Procedia Computer Science, 208, 460-470.
[10] Luo, C., Li, Y., & Chung, S. M. (2009). Text document clustering based on neighbors. Data & Knowledge Engineering, 68(11), 1271-1288.
[11] Mahdavi, M., & Abolhassani, H. (2009). Harmony K-means algorithm for document clustering. Data Mining and Knowledge Discovery, 18, 370–391.
[12] Minaee, S., Gao, J., Kalchbrenner, N., Cambria, E., Nikzad, N., & Chenaghlu, M. (2021). Deep Learning--based Text Classification: A Comprehensive Review. ACM Computing Surveys, 54(3), Article 62, 1–40.
[13] Mussabayev, R., Mladenovic, N., Jarboui, B., & Mussabayev, R. (2023). How to Use K-means for Big Data Clustering? Pattern Recognition, 137, 109269.
[14] Ridzuan, F., & Zainon, W. M. N. (2019). A Review on Data Cleansing Methods for Big Data. Procedia Computer Science, 161, 731-738.
[15] Sa, L. (2019). Text Clustering with K-Means. https://medium.com/@lucasdesa/text-clustering-with-k-means-a039d84a941b. [Accessed: 05-Jan-2024].
[16] Saha, B., & Srivastava, D. (2014). Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering (pp. 31 March 2014 - 04 April 2014). IEEE. https://doi.org/10.1109/ICDE.2014.6816764
[17] Thangaraj, M., & Sivakami, M. (2018). Text Classification Techniques: A Literature Review. Interdisciplinary Journal of Information, Knowledge, and Management, 13, 117-135.

Toplam 17 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Bilgisayar Yazılımı
Bölüm	Araştırma Makalesi
Yazarlar	Metin Oktay Boz 0009-0006-8620-7775 Jale Bektaş 0000-0002-8793-1486
Proje Numarası	icesconf-00370040
Yayımlanma Tarihi	31 Aralık 2024
Gönderilme Tarihi	27 Haziran 2024
Kabul Tarihi	18 Temmuz 2024
Yayımlandığı Sayı	Yıl 2024 Cilt: 04 Sayı: 02

Kaynak Göster

IEEE	M. O. Boz ve J. Bektaş, “An Investigation On The Execution Of The Document Clustering Process On Internet News”, Researcher, c. 04, sy. 02, ss. 113–119, 2024.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin

Yayın hayatına 2013 yılında başlamış olan "Researcher: Social Sciences Studies" (RSSS) dergisi, 2020 Ağustos ayı itibariyle "Researcher" ismiyle Ankara Bilim Üniversitesi bünyesinde faaliyetlerini sürdürmektedir.
2021 yılı ve sonrasında Mühendislik ve Fen Bilimleri alanlarında katkıda bulunmayı hedefleyen özgün araştırma makalelerinin yayımlandığı uluslararası indeksli, ulusal hakemli, bilimsel ve elektronik bir dergidir.
Dergi özel sayılar dışında yılda iki kez yayımlanmaktadır. Amaçları doğrultusunda dergimizin yayın odağında; Endüstri Mühendisliği, Yazılım Mühendisliği, Bilgisayar Mühendisliği ve Elektrik Elektronik Mühendisliği alanları bulunmaktadır.
Dergide yayımlanmak üzere gönderilen aday makaleler Türkçe ve İngilizce dillerinde yazılabilir. Dergiye gönderilen makalelerin daha önce başka bir dergide yayımlanmamış veya yayımlanmak üzere başka bir dergiye gönderilmemiş olması gerekmektedir.