Sampling Techniques and Application in Machine Learning in order to Analyse Crime Dataset

Ayla Saylı; Sevil Başarır

doi:10.31590/ejosat.1115323

Research Article

Sampling Techniques and Application in Machine Learning in order to Analyse Crime Dataset

Year 2022, , 296 - 310, 31.08.2022

Ayla Saylı , Sevil Başarır

https://doi.org/10.31590/ejosat.1115323

Cited By: 1

Abstract

Machine learning enables machines to learn information and make inferences using the information it has learned. In this article, five years of crime data were analyzed and the learning process was completed with the data in the machine's hands. One-Hot Encoding and Min-Max Normalization methods and Principal Component Analysis algorithm were used in the analysis of the data. The model was asked to predict whether the criminal could be caught, the security of the area, and the type of crime committed using the K-Nearest Neighborhood, Random Forest and Extreme Gradient Boosting algorithms. However, no matter how successful the model is in imbalanced datasets, the result will be misleading. Therefore, the main purpose of this article is to transform the imbalanced data into a balanced one by various methods and to find the most accurate sampling method for the data, which is compatible with the classification method. For this purpose, one statistical sampling method (Stratify), three over sampling method (Random Over Sampler, Synthetic Minority Over, Adaptive Synthetic), three under sampling method (Random Under Sampler, Near Miss, Neighborhood Cleaning Rule) and mix samplig method (Smote Tomek) have been applied to avoid imbalance of data in target areas such as Arrest, Crime Type,Security. As a result of the sampling methods applied, efficient and effective results were obtained.

Keywords

Sampling Techniques, Classification, Data Pre-Processing, Machine Learning, Crime Analysis, Data Analysis, Data Visualization.

Supporting Institution

Project Number

Thanks

References

Hibberts, M., Burke Johnson, R., & Hudson, K. (2012). Common survey sampling techniques. In Handbook of survey methodology for the social sciences (pp. 53-74). Springer, New York, NY.
Zhihao, P., Fenglong, Y., & Xucheng, L. (2019, April). Comparison of the different sampling techniques for imbalanced classification problems in machine learning. In 2019 11th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA) (pp. 431-434). IEEE.
Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
Sathyadevan, S., Devan, M. S., & Gangadharan, S. S. (2014, August). Crime analysis and prediction using data mining. In 2014 First international conference on networks & soft computing (ICNSC2014) (pp. 406-412). IEEE.
Junsomboon, N., & Phienthrakul, T. (2017, February). Combining over-sampling and under-sampling techniques for imbalance dataset. In Proceedings of the 9th International Conference on Machine Learning and Computing (pp. 243-247).
Prabakaran, S., & Mitra, S. (2018, April). Survey of analysis of crime detection techniques using data mining and machine learning. In Journal of Physics: Conference Series (Vol. 1000, No. 1, p. 012046). IOP Publishing.
Xie, C., Du, R., Ho, J. W., Pang, H. H., Chiu, K. W., Lee, E. Y., & Vardhanabhuti, V. (2020). Effect of machine learning re-sampling techniques for imbalanced datasets in 18F-FDG PET-based radiomics model on prognostication performance in cohorts of head and neck cancer patients. European journal of nuclear medicine and molecular imaging, 47(12), 2826-2835.
Etikan, I., & Bala, K. (2017). Sampling and sampling methods. Biometrics & Biostatistics International Journal, 5(6), 00149.
Albahli, S., Alsaqabi, A., Aldhubayi, F., Rauf, H. T., Arif, M., & Mohammed, M. A. (2021). Predicting the type of crime: Intelligence gathering and crime analysis. Computers, Materials & Continua, 66(3), 2317-2341.
Kurin, S., Steinshamn, S. I., & Saerens, M. (2017). " A comparison of classification models for imbalanced datasets.
Meng, X. (2013, May). Scalable simple random sampling and stratified sampling. In International Conference on Machine Learning (pp. 531-539). PMLR.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 61, 863-905.
Mani, I., & Zhang, I. (2003, August). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets (Vol. 126, pp. 1-7). ICML.
DURAHİM, A. O. (2016). Comparison of sampling techniques for imbalanced learning. Yönetim Bilişim Sistemleri Dergisi, 2(2), 181-191.
Pandey, A., & Jain, A. (2017). Comparative analysis of KNN algorithm using various normalization techniques. International Journal of Computer Network and Information Security, 9(11), 36.
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).
Zeng, G. (2020). On the confusion matrix in credit scoring and its analytical properties. Communications in Statistics-Theory and Methods, 49(9), 2080-2093.
Dalianis, H. (2018). Evaluation metrics and evaluation. In Clinical text mining (pp. 45-53). Springer, Cham.
Laurikkala, J. (2001, July). Improving identification of difficult small classes by balancing class distribution. In Conference on artificial intelligence in medicine in Europe (pp. 63-66). Springer, Berlin, Heidelberg.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). IEEE.
Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6, 769-772.
Batista, G. E., Bazzan, A. L., & Monard, M. C. (2003, December). Balancing Training Data for Automated Annotation of Keywords: a Case Study. In WOB (pp. 10-18).
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52.
Wu, W., Mallet, Y., Walczak, B., Penninckx, W., Massart, D. L., Heuerding, S., & Erni, F. (1996). Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to NIR data. Analytica Chimica Acta, 329(3), 257-265.

Suç Veri Setini Analiz Etmek İçin Makine Öğreniminde Örnekleme Teknikleri ve Uygulaması

Year 2022, , 296 - 310, 31.08.2022

Ayla Saylı , Sevil Başarır

https://doi.org/10.31590/ejosat.1115323

Cited By: 1

Abstract

Makine öğrenmesi, makinelerin bilgiyi öğrenmesini ve öğrendiği bilgiyi kullarak çıkarımlar yapmasını sağlar. Bu makalede, beş yıla ait suç verileri ele alınarak analiz edildi ve makinenin elindeki verilerle öğreme işleminin tamamlanması sağlandı. Verinin analizi sürecinde One-Hot Encoding ve Min-Max Normalizasyon methodları ile Principal Component Analysis algoritması kullanıldı. Modelden suçlunun yakalanıp yakalanamaması, bölgenin güvenliği ve işlenen suçun tipini K-Nearest Neighborhood, Random Forest ve Extreme Gradient Boosting algoritmaları kullanılarak tahmin etmesi istendi. Fakat dengesiz veri setlerinde model ne kadar başarılı olursa olsun sonuç yanıltıcı olur. Bu nedenle bu makalenin asıl amacı dengesiz verinin çeşitli methodlarla dengeli hale dönüştürülmesi ve veri için sınıflandırma methodu ile uyumlu en doğru örnekleme methodunu bulmaktır. Bu amaçla tutuklanma, suç tipi, güvenlik gibi hedef alanlarında verinin dengesizliğinin önüne geçmek için bir tane istatistiki örnekleme methodu (Tabakalaştırma), üç tane üst önekleyici method (Rastgele Üst Örnekleyici, Sentetik Azınlık Üstü, Uyarlamalı Sentetik), üç tanem alt örnekleyici method (Rastgele Alt Örnekleyici, Ramak Kala, Yakın Komşu Temizleme Kuralı) ve bir tane alt ve üst karışık örnekleme methodu (Smote Tomek) uygulanmıştır. Uygulanan örnekleme yöntemleri sonucunda verimli ve etkili sonuçlar elde edilmiştir.

Keywords

Örnekleme Teknikleri, Sınıflandırma, Veri Ön İnceleme, Makine Öğrenmesi, Suç Analizi, Veri Analizi, Veri Görselleştirme.

Project Number

References

Hibberts, M., Burke Johnson, R., & Hudson, K. (2012). Common survey sampling techniques. In Handbook of survey methodology for the social sciences (pp. 53-74). Springer, New York, NY.
Zhihao, P., Fenglong, Y., & Xucheng, L. (2019, April). Comparison of the different sampling techniques for imbalanced classification problems in machine learning. In 2019 11th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA) (pp. 431-434). IEEE.
Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
Sathyadevan, S., Devan, M. S., & Gangadharan, S. S. (2014, August). Crime analysis and prediction using data mining. In 2014 First international conference on networks & soft computing (ICNSC2014) (pp. 406-412). IEEE.
Junsomboon, N., & Phienthrakul, T. (2017, February). Combining over-sampling and under-sampling techniques for imbalance dataset. In Proceedings of the 9th International Conference on Machine Learning and Computing (pp. 243-247).
Prabakaran, S., & Mitra, S. (2018, April). Survey of analysis of crime detection techniques using data mining and machine learning. In Journal of Physics: Conference Series (Vol. 1000, No. 1, p. 012046). IOP Publishing.
Xie, C., Du, R., Ho, J. W., Pang, H. H., Chiu, K. W., Lee, E. Y., & Vardhanabhuti, V. (2020). Effect of machine learning re-sampling techniques for imbalanced datasets in 18F-FDG PET-based radiomics model on prognostication performance in cohorts of head and neck cancer patients. European journal of nuclear medicine and molecular imaging, 47(12), 2826-2835.
Etikan, I., & Bala, K. (2017). Sampling and sampling methods. Biometrics & Biostatistics International Journal, 5(6), 00149.
Albahli, S., Alsaqabi, A., Aldhubayi, F., Rauf, H. T., Arif, M., & Mohammed, M. A. (2021). Predicting the type of crime: Intelligence gathering and crime analysis. Computers, Materials & Continua, 66(3), 2317-2341.
Kurin, S., Steinshamn, S. I., & Saerens, M. (2017). " A comparison of classification models for imbalanced datasets.
Meng, X. (2013, May). Scalable simple random sampling and stratified sampling. In International Conference on Machine Learning (pp. 531-539). PMLR.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 61, 863-905.
Mani, I., & Zhang, I. (2003, August). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets (Vol. 126, pp. 1-7). ICML.
DURAHİM, A. O. (2016). Comparison of sampling techniques for imbalanced learning. Yönetim Bilişim Sistemleri Dergisi, 2(2), 181-191.
Pandey, A., & Jain, A. (2017). Comparative analysis of KNN algorithm using various normalization techniques. International Journal of Computer Network and Information Security, 9(11), 36.
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).
Zeng, G. (2020). On the confusion matrix in credit scoring and its analytical properties. Communications in Statistics-Theory and Methods, 49(9), 2080-2093.
Dalianis, H. (2018). Evaluation metrics and evaluation. In Clinical text mining (pp. 45-53). Springer, Cham.
Laurikkala, J. (2001, July). Improving identification of difficult small classes by balancing class distribution. In Conference on artificial intelligence in medicine in Europe (pp. 63-66). Springer, Berlin, Heidelberg.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). IEEE.
Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6, 769-772.
Batista, G. E., Bazzan, A. L., & Monard, M. C. (2003, December). Balancing Training Data for Automated Annotation of Keywords: a Case Study. In WOB (pp. 10-18).
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52.
Wu, W., Mallet, Y., Walczak, B., Penninckx, W., Massart, D. L., Heuerding, S., & Erni, F. (1996). Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to NIR data. Analytica Chimica Acta, 329(3), 257-265.

There are 25 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Articles
Authors	Ayla Saylı 0000-0003-0409-537X Sevil Başarır 0000-0002-1599-0727
Project Number	-
Publication Date	August 31, 2022
Published in Issue	Year 2022

Cite

APA	Saylı, A., & Başarır, S. (2022). Sampling Techniques and Application in Machine Learning in order to Analyse Crime Dataset. Avrupa Bilim Ve Teknoloji Dergisi(38), 296-310. https://doi.org/10.31590/ejosat.1115323

Cited By

Robust predictive framework for diabetes classification using optimized machine learning on imbalanced datasets

Frontiers in Artificial Intelligence

https://doi.org/10.3389/frai.2024.1499530

Article Files

Full Text