Review Article
BibTex RIS Cite

Big data reduction and visualization using the K-means algorithm

Year 2022, Volume: 02 Issue: 01, 40 - 45, 31.07.2022

Abstract

A huge amount of data is being produced every day in our era. In addition to high-performance processing approaches, efficiently visualizing this quantity of data (up to Terabytes) remains a major difficulty. In this study, we use the well-known clustering method K-means as a data reduction strategy that keeps the visual quality of the provided huge data as high as possible. The centroids of the dataset are used to display the distribution properties of data in a straightforward manner. Our data comes from a recent Kaggle big data set (Click Through Rate), and it is displayed using Box plots on reduced datasets, compared to the original plots. It is discovered that K-means is an effective strategy for reducing the amount of huge data in order to view the original data without sacrificing its distribution information quality.

References

  • [1] Friendly, M. (2008). A brief history of data visualization. In Handbook of data visualization (pp. 15-56). Springer, Berlin, Heidelberg.
  • [2] Keim, D., Qu, H., & Ma, K. L. (2013). Big-data visualization. IEEE Computer Graphics and Applications, 33(4), 20-21.
  • [3] Andrienko, G., Andrienko, N., Drucker, S., Fekete, J. D., Fisher, D., Idreos, S., ... & Sharaf, M. (2020, March). Big data visualization and analytics: Future research challenges and emerging applications. In BigVis 2020-3rd International Workshop on Big Data Visual Exploration and Analytics.
  • [4] Agrawal, R., Kadadi, A., Dai, X., & Andres, F. (2015). Challenges and opportunities with big data visualization. In Proceedings of the 7th International Conference on Management of computational and collective intElligence in Digital EcoSystems (pp. 169-173).
  • [5] Ali, S. M., Gupta, N., Nayak, G. K., & Lenka, R. K. (2016). Big data visualization: Tools and challenges. In 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I) (pp. 656-660). IEEE.
  • [6] Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering algorithm. Pattern recognition, 36(2), 451-461.
  • [7] Dokeroglu, T., Deniz, A., & Kiziloz, H. E. (2022). A Comprehensive Survey on Recent Metaheuristics for Feature Selection. Neurocomputing.
  • [8] Click-Through Rate (CTR), https://www.kaggle.com/datasets/louischen7/2020-digix-advertisement-ctr- prediction, 2022.

Big data reduction and visualization using the K-means algorithm

Year 2022, Volume: 02 Issue: 01, 40 - 45, 31.07.2022

Abstract

Çağımızda her gün çok büyük miktarda veri üretiliyor. Yüksek performanslı işleme yaklaşımlarına ek olarak, bu veri miktarını (Terabayt'a kadar) verimli bir şekilde görselleştirmek büyük bir zorluk olmaya devam ediyor. Bu çalışmada, sağlanan büyük verilerin görsel kalitesini mümkün olduğunca yüksek tutan bir veri azaltma stratejisi olarak iyi bilinen kümeleme yöntemi K-araçlarını kullanıyoruz. Veri kümesinin merkezleri, verilerin dağıtım özelliklerini basit bir şekilde görüntülemek için kullanılır. Verilerimiz yeni bir Kaggle büyük veri setinden (Tıklama Oranı) gelir ve orijinal grafiklere kıyasla azaltılmış veri kümelerinde Box grafikleri kullanılarak görüntülenir. K-araçlarının, dağıtım bilgisi kalitesinden ödün vermeden orijinal verileri görüntülemek için büyük veri miktarını azaltmak için etkili bir strateji olduğu keşfedildi.

References

  • [1] Friendly, M. (2008). A brief history of data visualization. In Handbook of data visualization (pp. 15-56). Springer, Berlin, Heidelberg.
  • [2] Keim, D., Qu, H., & Ma, K. L. (2013). Big-data visualization. IEEE Computer Graphics and Applications, 33(4), 20-21.
  • [3] Andrienko, G., Andrienko, N., Drucker, S., Fekete, J. D., Fisher, D., Idreos, S., ... & Sharaf, M. (2020, March). Big data visualization and analytics: Future research challenges and emerging applications. In BigVis 2020-3rd International Workshop on Big Data Visual Exploration and Analytics.
  • [4] Agrawal, R., Kadadi, A., Dai, X., & Andres, F. (2015). Challenges and opportunities with big data visualization. In Proceedings of the 7th International Conference on Management of computational and collective intElligence in Digital EcoSystems (pp. 169-173).
  • [5] Ali, S. M., Gupta, N., Nayak, G. K., & Lenka, R. K. (2016). Big data visualization: Tools and challenges. In 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I) (pp. 656-660). IEEE.
  • [6] Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering algorithm. Pattern recognition, 36(2), 451-461.
  • [7] Dokeroglu, T., Deniz, A., & Kiziloz, H. E. (2022). A Comprehensive Survey on Recent Metaheuristics for Feature Selection. Neurocomputing.
  • [8] Click-Through Rate (CTR), https://www.kaggle.com/datasets/louischen7/2020-digix-advertisement-ctr- prediction, 2022.
There are 8 citations in total.

Details

Primary Language English
Subjects Computer Software
Journal Section Research Article
Authors

Hakan Akyol 0000-0002-5695-8790

Hale Sema Kızılduman 0000-0002-5695-8790

Tansel Dökeroğlu 0000-0003-1665-5928

Publication Date July 31, 2022
Published in Issue Year 2022 Volume: 02 Issue: 01

Cite

IEEE H. Akyol, H. S. Kızılduman, and T. Dökeroğlu, “Big data reduction and visualization using the K-means algorithm”, Researcher, vol. 02, no. 01, pp. 40–45, 2022, doi: 10.55185/researcher.1135824.

The journal "Researcher: Social Sciences Studies" (RSSS), which started its publication life in 2013, continues its activities under the name of "Researcher" as of August 2020, under Ankara Bilim University.
It is an internationally indexed, nationally refereed, scientific and electronic journal that publishes original research articles aiming to contribute to the fields of Engineering and Science in 2021 and beyond.
The journal is published twice a year, except for special issues.
Candidate articles submitted for publication in the journal can be written in Turkish and English. Articles submitted to the journal must not have been previously published in another journal or sent to another journal for publication.