A Comparative Evaluation of the Outlier Detection Methods
Yıl 2024,
, 155 - 159, 15.03.2024
Melis Çelik Güney
,
Gökhan Tamer Kayaalp
Öz
In data mining, in order to calculate descriptive statistics and other statistical model parameters correctly, outliers should be identified and excluded from the data set before starting data analysis. This paper studied and compared the performance of model-based, density-based, clustering-based, angle-based, and isolation-based outlier detection methods used in data mining. ROC and AUC curves were used to compare the performances of outlier detection methods. A data set with a standard normal distribution and fit a logistic regression was simulated. To compare the methods, the data was modified by randomly adding 30 outliers to the data set. The iForest algorithm was found to have higher predictive power than Mahalanobis, LOF, k-means, and ABOD. In addition, outliers were found in a real data set with the iForest algorithm and deleted from the data set. Then, the data sets with outliers and without outliers were compared. The results showed that the model without outliers has a higher predictive ability.
Proje Numarası
FDK-2018 10287
Kaynakça
- Auslander B, Gupta KM, Aha DW. 2011. A comparative evaluation of anomaly detection algorithms for maritime video surveillance. Proceedings of the Society of Photographic Instrumentation Engineers Conference, June 15-17, Orlando, US, Vol. 8019, pp: 27-40.
- Bharadiya JP. 2023. A comparative study of business intelligence and artificial intelligence with big data analytics. American J Artific Intel, 7(1): 24-30.
- Ben-Gal I. 2005. Outlier detection. In Data Mining and Knowledge Discovery Handbook, Springer, Boston, US, pp: 288.
- Bertizlioglu IN, Ozgonenel O. 2012. Blackout detection using k-means clustering method. ELECO'2012 Electrical and Electronics Engineering Symposium, November 29-December 1, Bursa, Turkiye.
- Breunig MM, Kriegel HP, Ng RT, Sander J. 2000. LOF: Identifying Density-Based Local Outliers. In ACM Sigmod Record, 29(2): 93-104.
- Cebeci Z. 2020. Data preprocessing with R in data science. Nobel Academic Publishing, Ankara, Türkiye, opp: 552.
- Cebeci Z, Cebeci C, Tahtali Y, Bayyurt L. 2022. Two novel outlier detection approaches based on unsupervised possibilistic and fuzzy clustering. PeerJ Comp Sci, 8: e1060.
- Deb AB, Dey L. 2017. Outlier detection and removal algorithm in k-means and hierarchical clustering. World J Comp Appl Technol, 5(2): 24-29.
- Filzmoser P, Varmuza K. 2017. Chemometrics: Multivariate Statistical Analysis in Chemometrics. URL: https://CRAN.R-project.org/package=chemometrics. (accessed date: February 10, 2023).
- Gao R, Zhang T, Sun S, Liu Z. 2019. Research and improvement of isolation forest in detection of local anomaly points. J Physics: Conf Series, 1237(5): 1-6.
- Gnat S. 2020. Testing the effectiveness of outlier detecting methods in property classification. Real Estate Manag Valuat, 28(4): 81-92.
- Gogoi P, Bhattacharyya D, Borah B, Kalita JK. 2011. A survey of outlier detection methods in network anomaly identification. Comput J, 54(4): 570-588.
- Graves E, Drozdov I. 2019. Zelazny7/isofor: Isolation forest anomaly detection. URL: https://github.com/Zelazny7/isofor. (accessed date: February 01, 2023).
- Han J, Pei J, Pei J. 2012. Data mining: concepts and techniques, Third Edition. Morgan Kaufmann Publishers Elsevier, US, pp: 744.
- Hou S, Gao J, Wang C. 2023. Order acceptance choice modeling of crowd-sourced delivery services: a systematic comparative study. URL: https://www.techrxiv.org/doi/full/10.36227/techrxiv.24139491.v1 (accessed date: February 23, 2023).
- Hodge V, Austin J. 2004. A survey of outlier detection methodologies. Artific Intel Rev, 22(2): 85-126.
- Hofmann M, Klinkenberg R. 2014. RapidMiner: Data mining use cases and business analytics applications. CRC Press, New York, US, pp: 528.
- Hu Y, Murray W, Australia YS. 2015. Rlof: R parallel implementation of local outlier factor (LOF). URL: https://CRAN.R-project.org/package=Rlof (accessed date: January 12, 2023).
- Jimenez J. 2015. abodOutlier: angle-based outlier detection. URL: https://CRAN.R-project.org/package=abodOutlier (accessed date: January 12, 2023).
- Juarto B. 2023. Breast Cancer classification using outlier detection and variance inflation factor. Eng Math Comp Sci J, 5(1): 17-23.
- Kaya H, Koymen K. 2008. Data mining concept and application areas. Fırat Univ Doğu Araşt Derg, 6(2): 159-164.
- Kiruthika S, Sowmyarani CN. 2020. Credit card fraud detection using machine learning and deployment of model in public cloud as a web service. Int J Recent Technol Eng, 9(2): 548-552.
- Kriegel HP, Schubert M, Zimek A. 2008. Angle-based outlier detection in high-dimensional data. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, Las Vegas, US, pp: 444-452.
- Leys C, Klein O, Dominicy Y, Ley C. 2017. Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance. J Exp Soc Psychol, 74: 150-156.
- Liu FT, Ting KM, Zhou ZH. 2008. Isolation forest. Eighth IEEE International Conference on Data Mining, December 15-19, Pisa, Italy, pp: 413-422.
- Mertler CA, Vannatta RA. 2005. Advanced and multivariate statistical methods: practical application and interpretation, 3rd edition. Glendale, Pyrczak Publishing, Los Angeles, US, pp: 234.
- Molnar C. 2019. Interpretable machine learning: a guide for making black box models explainable. URL: https://christophm.github.io/interpretable-ml-book/ (accessed date: September 20, 2023).
- Negi SS. 2020. Early prediction of credit card fraud detection using isolation forest tree and local outlier factor machine learning algorithms. A Project Report of Capstone Project-2. Galgotias University, Uttar Pradesh, India, Act No: 14.
- Nurunnabi A, West G. 2012. Outlier detection in logistic regression: A quest for reliable knowledge from predictive modeling and classification. IEEE 12th international conference on data mining workshops, December 10, pp: 643-652.
- Omar AAC, Nassif AB. 2023. Lung cancer prediction using machine learning based feature selection: a comparative study. Advances in Science and Engineering Technology International Conferences (ASET), February 20-23, pp: 1-6.
- Osborne JW, Amy O. 2004. The power of outliers (and why researchers should always check for them). Pract Asses Res Eval, 9(6): 1-12.
- Prykhodko S, Prykhodko N, Makarova L, Pukhalevych S. 2018. Application of the squared mahalanobis distance for detecting outliers in multivariate non-Gaussian data. 14th International Conference on Advanced Trends in Radioelecrtronics, Telecommunications and Computer Engineering (TCSET), February 20-24, Lviv-Slavske, Ukraine, pp: 962-965.
- Rousseeuw PJ, Van Zomeren BC. 1990. Unmasking multivariate outliers and leverage points. J American Stat Assoc, 85(411): 633-639.
- Sharma DK, Chatterjee M, Kaur G, Vavilala S. 2022. Deep learning applications for disease diagnosis. Academic Press, Cambridge, US, pp: 31-51.
- Vijayakumar V, Divya NS, Sarojini P, Sonika K. 2020. Isolation forest and local outlier factor for credit card fraud detection system. Int J Eng Adv Technol, 9(4): 261-265.
- Xu X, Liu H, Li L, Yao M. 2018. A comparison of outlier detection techniques for high-dimensional data. Int J Comput Intel Syst, 11(1): 652-662.
- Yadav J. Sharma M. 2013. A review of k-mean algorithm. Int J Eng Trends Technol, 4(7): 2972-2976.
- Yucel Altay S. 2014. Using of spatio-temporal data mining for trajectory outlier detection and interpretation in health care services. MS Thesis, Atatürk University, Graduate School of Natural and Applied Sciences, Erzurum, Türkiye, pp: 25-32.
- Zhao K, Tung CW, Eizenga GC, Wright MH, Ali ML, Price AH, Norton GJ, Islam MR, Reynolds A, Mezey J, McClung AM, Bustamante CD, McCouch SR. 2011. Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nature Commun, 2(1): 467.
A Comparative Evaluation of the Outlier Detection Methods
Yıl 2024,
, 155 - 159, 15.03.2024
Melis Çelik Güney
,
Gökhan Tamer Kayaalp
Öz
In data mining, in order to calculate descriptive statistics and other statistical model parameters correctly, outliers should be identified and excluded from the data set before starting data analysis. This paper studied and compared the performance of model-based, density-based, clustering-based, angle-based, and isolation-based outlier detection methods used in data mining. ROC and AUC curves were used to compare the performances of outlier detection methods. A data set with a standard normal distribution and fit a logistic regression was simulated. To compare the methods, the data was modified by randomly adding 30 outliers to the data set. The iForest algorithm was found to have higher predictive power than Mahalanobis, LOF, k-means, and ABOD. In addition, outliers were found in a real data set with the iForest algorithm and deleted from the data set. Then, the data sets with outliers and without outliers were compared. The results showed that the model without outliers has a higher predictive ability.
Etik Beyan
Ethical Consideration
Ethics committee approval was not required for this study because of there was no study on animals or humans. The authors confirm that the ethical policies of the journal, as noted on the journal's author guidelines page, have been adhered to.
Destekleyen Kurum
Cukurova University
Proje Numarası
FDK-2018 10287
Teşekkür
We gratefully thank to Prof. Dr. Zeynel CEBECİ at the Cukurova University for his contributions in this study.
We would like to thank Cukurova University Scientific Research Coordinatorship for supporting this study with project number FDK-2018 10287.
It was produced from the thesis titled “Comparative Examination of Outlier Detection Methods in Binary Logistics Regression Analysis” at Cukurova University Thesis no: 794371. https://tez.yok.gov.tr/UlusalTezMerkezi/tezSorguSonucYeni.jsp.
Kaynakça
- Auslander B, Gupta KM, Aha DW. 2011. A comparative evaluation of anomaly detection algorithms for maritime video surveillance. Proceedings of the Society of Photographic Instrumentation Engineers Conference, June 15-17, Orlando, US, Vol. 8019, pp: 27-40.
- Bharadiya JP. 2023. A comparative study of business intelligence and artificial intelligence with big data analytics. American J Artific Intel, 7(1): 24-30.
- Ben-Gal I. 2005. Outlier detection. In Data Mining and Knowledge Discovery Handbook, Springer, Boston, US, pp: 288.
- Bertizlioglu IN, Ozgonenel O. 2012. Blackout detection using k-means clustering method. ELECO'2012 Electrical and Electronics Engineering Symposium, November 29-December 1, Bursa, Turkiye.
- Breunig MM, Kriegel HP, Ng RT, Sander J. 2000. LOF: Identifying Density-Based Local Outliers. In ACM Sigmod Record, 29(2): 93-104.
- Cebeci Z. 2020. Data preprocessing with R in data science. Nobel Academic Publishing, Ankara, Türkiye, opp: 552.
- Cebeci Z, Cebeci C, Tahtali Y, Bayyurt L. 2022. Two novel outlier detection approaches based on unsupervised possibilistic and fuzzy clustering. PeerJ Comp Sci, 8: e1060.
- Deb AB, Dey L. 2017. Outlier detection and removal algorithm in k-means and hierarchical clustering. World J Comp Appl Technol, 5(2): 24-29.
- Filzmoser P, Varmuza K. 2017. Chemometrics: Multivariate Statistical Analysis in Chemometrics. URL: https://CRAN.R-project.org/package=chemometrics. (accessed date: February 10, 2023).
- Gao R, Zhang T, Sun S, Liu Z. 2019. Research and improvement of isolation forest in detection of local anomaly points. J Physics: Conf Series, 1237(5): 1-6.
- Gnat S. 2020. Testing the effectiveness of outlier detecting methods in property classification. Real Estate Manag Valuat, 28(4): 81-92.
- Gogoi P, Bhattacharyya D, Borah B, Kalita JK. 2011. A survey of outlier detection methods in network anomaly identification. Comput J, 54(4): 570-588.
- Graves E, Drozdov I. 2019. Zelazny7/isofor: Isolation forest anomaly detection. URL: https://github.com/Zelazny7/isofor. (accessed date: February 01, 2023).
- Han J, Pei J, Pei J. 2012. Data mining: concepts and techniques, Third Edition. Morgan Kaufmann Publishers Elsevier, US, pp: 744.
- Hou S, Gao J, Wang C. 2023. Order acceptance choice modeling of crowd-sourced delivery services: a systematic comparative study. URL: https://www.techrxiv.org/doi/full/10.36227/techrxiv.24139491.v1 (accessed date: February 23, 2023).
- Hodge V, Austin J. 2004. A survey of outlier detection methodologies. Artific Intel Rev, 22(2): 85-126.
- Hofmann M, Klinkenberg R. 2014. RapidMiner: Data mining use cases and business analytics applications. CRC Press, New York, US, pp: 528.
- Hu Y, Murray W, Australia YS. 2015. Rlof: R parallel implementation of local outlier factor (LOF). URL: https://CRAN.R-project.org/package=Rlof (accessed date: January 12, 2023).
- Jimenez J. 2015. abodOutlier: angle-based outlier detection. URL: https://CRAN.R-project.org/package=abodOutlier (accessed date: January 12, 2023).
- Juarto B. 2023. Breast Cancer classification using outlier detection and variance inflation factor. Eng Math Comp Sci J, 5(1): 17-23.
- Kaya H, Koymen K. 2008. Data mining concept and application areas. Fırat Univ Doğu Araşt Derg, 6(2): 159-164.
- Kiruthika S, Sowmyarani CN. 2020. Credit card fraud detection using machine learning and deployment of model in public cloud as a web service. Int J Recent Technol Eng, 9(2): 548-552.
- Kriegel HP, Schubert M, Zimek A. 2008. Angle-based outlier detection in high-dimensional data. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, Las Vegas, US, pp: 444-452.
- Leys C, Klein O, Dominicy Y, Ley C. 2017. Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance. J Exp Soc Psychol, 74: 150-156.
- Liu FT, Ting KM, Zhou ZH. 2008. Isolation forest. Eighth IEEE International Conference on Data Mining, December 15-19, Pisa, Italy, pp: 413-422.
- Mertler CA, Vannatta RA. 2005. Advanced and multivariate statistical methods: practical application and interpretation, 3rd edition. Glendale, Pyrczak Publishing, Los Angeles, US, pp: 234.
- Molnar C. 2019. Interpretable machine learning: a guide for making black box models explainable. URL: https://christophm.github.io/interpretable-ml-book/ (accessed date: September 20, 2023).
- Negi SS. 2020. Early prediction of credit card fraud detection using isolation forest tree and local outlier factor machine learning algorithms. A Project Report of Capstone Project-2. Galgotias University, Uttar Pradesh, India, Act No: 14.
- Nurunnabi A, West G. 2012. Outlier detection in logistic regression: A quest for reliable knowledge from predictive modeling and classification. IEEE 12th international conference on data mining workshops, December 10, pp: 643-652.
- Omar AAC, Nassif AB. 2023. Lung cancer prediction using machine learning based feature selection: a comparative study. Advances in Science and Engineering Technology International Conferences (ASET), February 20-23, pp: 1-6.
- Osborne JW, Amy O. 2004. The power of outliers (and why researchers should always check for them). Pract Asses Res Eval, 9(6): 1-12.
- Prykhodko S, Prykhodko N, Makarova L, Pukhalevych S. 2018. Application of the squared mahalanobis distance for detecting outliers in multivariate non-Gaussian data. 14th International Conference on Advanced Trends in Radioelecrtronics, Telecommunications and Computer Engineering (TCSET), February 20-24, Lviv-Slavske, Ukraine, pp: 962-965.
- Rousseeuw PJ, Van Zomeren BC. 1990. Unmasking multivariate outliers and leverage points. J American Stat Assoc, 85(411): 633-639.
- Sharma DK, Chatterjee M, Kaur G, Vavilala S. 2022. Deep learning applications for disease diagnosis. Academic Press, Cambridge, US, pp: 31-51.
- Vijayakumar V, Divya NS, Sarojini P, Sonika K. 2020. Isolation forest and local outlier factor for credit card fraud detection system. Int J Eng Adv Technol, 9(4): 261-265.
- Xu X, Liu H, Li L, Yao M. 2018. A comparison of outlier detection techniques for high-dimensional data. Int J Comput Intel Syst, 11(1): 652-662.
- Yadav J. Sharma M. 2013. A review of k-mean algorithm. Int J Eng Trends Technol, 4(7): 2972-2976.
- Yucel Altay S. 2014. Using of spatio-temporal data mining for trajectory outlier detection and interpretation in health care services. MS Thesis, Atatürk University, Graduate School of Natural and Applied Sciences, Erzurum, Türkiye, pp: 25-32.
- Zhao K, Tung CW, Eizenga GC, Wright MH, Ali ML, Price AH, Norton GJ, Islam MR, Reynolds A, Mezey J, McClung AM, Bustamante CD, McCouch SR. 2011. Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nature Commun, 2(1): 467.