Research Article
BibTex RIS Cite

INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA

Year 2017, Volume: 5 Issue: 1, 181 - 189, 30.06.2017
https://doi.org/10.17261/Pressacademia.2017.588

Abstract

Analyzing and detecting anomalies in
huge amount of data are a big challenge. On one hand we are faced with the
problem of storing a large amount of data, on the other to process it and
detect anomalies in reasonable or even real time. Real time analytics can be
defined as the capacity to use all available enterprise data and sources in the
moment they arrive or happen in the system. In this paper, we present an
infrastructure that we have implemented in order to analyze data from big log
files in real time. Also we present algorithms that are used for anomaly
detection in big data. The algorithms are implemented in R language. The main components
of the infrastructure are Redis, Logstash, Elasticsearch, elastic-R client and
Kibana. We explore implementation of several filters in order to post-process
the log information and produce various statistics that suit our needs in
analyzing log files containing SQL queries from a big national system in
education. The post-processing of the SQL queries is mainly focused on
preparing the log information in adequate format and information extraction.
The other interesting part of the paper is to compare the anomaly detection
algorithms and to conclude which of them is better to us for our needs. Also we
add the elastic-R client to the infrastructure we develop for big data analytic
in order to detect anomalies. The purpose of the analysis is to monitor performance
and detect anomalies in order to prevent possible problems in real time.



 

References

  • pgBadger. Retrieved April 04, 2015, from http://sourceforge.net/projects/pgbadger/.
  • Ian Delahorne. Postgresql Metrics With Logstash. Retrieved April 04, 2015, from http://ian.delahorne.com/blog/2014/06/10/postgresqlmetrics-pipeline
  • Logstash. Retrieved April 05, 2015, from http://logstash.net/docs/1.4.2/filters/metrics.
  • James Turnbull. The Logstash Book Log management made easy. January 26, 2014.
  • Radu Gheorghe and Matthew Lee Hinman. Elasticsearch in action. Manning Publications 2014.
  • Mitchell Anicas. How To Use Logstash and Kibana To Centralize Logs On Ubuntu 14.04. Retrieved April 06, 2015, from https://www.digitalocean.com/community/tutorials/how-to-use-logstash-and-kibana-to-centralize-and-visualize-logs-on-ubuntu-14-04.
  • Zirije Hasani, Margita Kon-Popovska, Goran Velinov. Survey of Technologies for Real Time Big Data Streams Analytic. 11th International Conference on Informatics and Information Technologies. April 11-13, 2014 – Bitola, Macedonia.
  • Zirije Hasani, Margita Kon-Popovska, Goran Velinov. Lambda Architecture for Real Time Big Data Analytic. ICT Innovations 2014 Web Proceedings ISSN 1857-7288
  • Zirije Hasani. Performance comparison throw running job in Hadoop by defining the number of maps and reduces. 12th International Conference on Informatics and Information Technologies 2015. April 24-26, 2015 – Bitola, Macedonia.
  • Zirije Hasani. Virtuoso, System for Saving Semantic Data. 12th International Conference on Informatics and Information Technologies 2015. April 24-26, 2015 – Bitola, Macedonia
  • Apache Lucena. Retrieved April 30, 2015, from https://lucene.apache.org/.
  • Redis. Retrieved April 30, 2015, from http://redis.io/.
  • DBSCAN. Retrieved December 20, 2016, from https://cran.r-project.org/web/packages/dbscan/dbscan.pdf elastic r client. Retrieved November 20 2016, from http://finzi.psych.upenn.edu/library/elastic/html/elastic.html
  • doubleMAD algorithm. Retrieved November 10 2016, from http://eurekastatistics.com/using-the-median-absolute-deviation-to-findoutliers/
  • runMAD algorithm. Retrieved November 10 2016, from http://svitsrv25.epfl.ch/R-doc/library/caTools/html/runmad.html
  • Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology. Elsevier Inc. 2013.
  • Miller, J. (1991). Reaction time analysis with outlier exclusion: Bias varies with sample size. The Quarterly Journal of Experimental Psychology, 43(4), 907–912, http://dx.doi.org/10.1080/14640749108400962.
  • Zirije Hasani, Boro Jakimovski, Margita Kon-Popovska and Goran Velinov. Real time analytic of SQL queries based on log analytic. ICT Innovations 2015
  • Mark Kasunic, James McCurley, Dennis Goldenson and David Zubrow. An Investigation of Techniques for Detecting Data Anomalies in Earned Value Management Data. Carnegie Mellon University. December 2011 https://pdfs.semanticscholar.org/b998/7cd7e7244b1235a21c72c5a6f6634a9ff430.pdf.
Year 2017, Volume: 5 Issue: 1, 181 - 189, 30.06.2017
https://doi.org/10.17261/Pressacademia.2017.588

Abstract

References

  • pgBadger. Retrieved April 04, 2015, from http://sourceforge.net/projects/pgbadger/.
  • Ian Delahorne. Postgresql Metrics With Logstash. Retrieved April 04, 2015, from http://ian.delahorne.com/blog/2014/06/10/postgresqlmetrics-pipeline
  • Logstash. Retrieved April 05, 2015, from http://logstash.net/docs/1.4.2/filters/metrics.
  • James Turnbull. The Logstash Book Log management made easy. January 26, 2014.
  • Radu Gheorghe and Matthew Lee Hinman. Elasticsearch in action. Manning Publications 2014.
  • Mitchell Anicas. How To Use Logstash and Kibana To Centralize Logs On Ubuntu 14.04. Retrieved April 06, 2015, from https://www.digitalocean.com/community/tutorials/how-to-use-logstash-and-kibana-to-centralize-and-visualize-logs-on-ubuntu-14-04.
  • Zirije Hasani, Margita Kon-Popovska, Goran Velinov. Survey of Technologies for Real Time Big Data Streams Analytic. 11th International Conference on Informatics and Information Technologies. April 11-13, 2014 – Bitola, Macedonia.
  • Zirije Hasani, Margita Kon-Popovska, Goran Velinov. Lambda Architecture for Real Time Big Data Analytic. ICT Innovations 2014 Web Proceedings ISSN 1857-7288
  • Zirije Hasani. Performance comparison throw running job in Hadoop by defining the number of maps and reduces. 12th International Conference on Informatics and Information Technologies 2015. April 24-26, 2015 – Bitola, Macedonia.
  • Zirije Hasani. Virtuoso, System for Saving Semantic Data. 12th International Conference on Informatics and Information Technologies 2015. April 24-26, 2015 – Bitola, Macedonia
  • Apache Lucena. Retrieved April 30, 2015, from https://lucene.apache.org/.
  • Redis. Retrieved April 30, 2015, from http://redis.io/.
  • DBSCAN. Retrieved December 20, 2016, from https://cran.r-project.org/web/packages/dbscan/dbscan.pdf elastic r client. Retrieved November 20 2016, from http://finzi.psych.upenn.edu/library/elastic/html/elastic.html
  • doubleMAD algorithm. Retrieved November 10 2016, from http://eurekastatistics.com/using-the-median-absolute-deviation-to-findoutliers/
  • runMAD algorithm. Retrieved November 10 2016, from http://svitsrv25.epfl.ch/R-doc/library/caTools/html/runmad.html
  • Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology. Elsevier Inc. 2013.
  • Miller, J. (1991). Reaction time analysis with outlier exclusion: Bias varies with sample size. The Quarterly Journal of Experimental Psychology, 43(4), 907–912, http://dx.doi.org/10.1080/14640749108400962.
  • Zirije Hasani, Boro Jakimovski, Margita Kon-Popovska and Goran Velinov. Real time analytic of SQL queries based on log analytic. ICT Innovations 2015
  • Mark Kasunic, James McCurley, Dennis Goldenson and David Zubrow. An Investigation of Techniques for Detecting Data Anomalies in Earned Value Management Data. Carnegie Mellon University. December 2011 https://pdfs.semanticscholar.org/b998/7cd7e7244b1235a21c72c5a6f6634a9ff430.pdf.
There are 19 citations in total.

Details

Journal Section Articles
Authors

Zirje Hasani This is me

Publication Date June 30, 2017
Published in Issue Year 2017 Volume: 5 Issue: 1

Cite

APA Hasani, Z. (2017). INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA. PressAcademia Procedia, 5(1), 181-189. https://doi.org/10.17261/Pressacademia.2017.588
AMA Hasani Z. INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA. PAP. June 2017;5(1):181-189. doi:10.17261/Pressacademia.2017.588
Chicago Hasani, Zirje. “INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA”. PressAcademia Procedia 5, no. 1 (June 2017): 181-89. https://doi.org/10.17261/Pressacademia.2017.588.
EndNote Hasani Z (June 1, 2017) INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA. PressAcademia Procedia 5 1 181–189.
IEEE Z. Hasani, “INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA”, PAP, vol. 5, no. 1, pp. 181–189, 2017, doi: 10.17261/Pressacademia.2017.588.
ISNAD Hasani, Zirje. “INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA”. PressAcademia Procedia 5/1 (June 2017), 181-189. https://doi.org/10.17261/Pressacademia.2017.588.
JAMA Hasani Z. INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA. PAP. 2017;5:181–189.
MLA Hasani, Zirje. “INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA”. PressAcademia Procedia, vol. 5, no. 1, 2017, pp. 181-9, doi:10.17261/Pressacademia.2017.588.
Vancouver Hasani Z. INFRASTRUCTURE WITH R PACKAGE FOR ANOMALY DETECTION IN REAL TIME BIG LOG DATA. PAP. 2017;5(1):181-9.

PressAcademia Procedia (PAP) publishes proceedings of conferences, seminars and symposiums. PressAcademia Procedia aims to provide a source for academic researchers, practitioners and policy makers in the area of social and behavioral sciences, and engineering.

PressAcademia Procedia invites academic conferences for publishing their proceedings with a review of editorial board. Since PressAcademia Procedia is an double blind peer-reviewed open-access book, the manuscripts presented in the conferences can easily be reached by numerous researchers. Hence, PressAcademia Procedia increases the value of your conference for your participants. 

PressAcademia Procedia provides an ISBN for each Conference Proceeding Book and a DOI number for each manuscript published in this book.

PressAcademia Procedia is currently indexed by DRJI, J-Gate, International Scientific Indexing, ISRA, Root Indexing, SOBIAD, Scope, EuroPub, Journal Factor Indexing and InfoBase Indexing. 

Please contact to procedia@pressacademia.org for your conference proceedings.