TY - JOUR T1 - Medikal Sentetik Veri Üretimiyle Veri Dengelemesi TT - Data Balancing with Synthetic Medical Data Generation AU - Esen, M. Fevzi AU - Deveci, Ahmet PY - 2022 DA - June DO - 10.52693/jsas.1105599 JF - Journal of Statistics and Applied Sciences JO - JSAS PB - Abdulkadir KESKİN WT - DergiPark SN - 2718-0999 SP - 17 EP - 27 IS - 5 LA - tr AB - Sağlık hizmetleri planlaması, klinik deneyler ve araştırma geliştirme çalışmaları gibi sağlık verisi kullanımını gerektiren alanlarda, kişisel sağlık verisinin elde edilmesi ve kullanımında etik, bürokratik ve operasyonel zorluklar yaşanmaktadır. Elektronik kişisel sağlık kayıtlarının güvenliği ve kişisel veri mahremiyeti konularındaki kısıtlamalar başta olmak üzere, klinik ve saha çalışmalarından veri elde edilmesinin maliyetli ve zaman alıcı olması, gerçek veriye en yakın şekilde yapay veri üretilmesini gerekli kılmaktadır. Bu çalışmada, son dönemde sağlık alanında artan veri kullanımı ihtiyacı doğrultusunda, sentetik veri kullanımının önemi ele alınarak, sentetik veri üretiminde kullanılan SMOTE, SMOTEENN, BorderlineSMOTE, SMOTETomek ve ADASYN yöntemlerinin performanslarının karşılaştırılması amaçlanmıştır. Çalışmada, gözlem ve sınıf sayısı birbirinden farklı ve ikisi de kamuya açık, 390 hastaya ait 15 değişkenden oluşan veri seti ile 19.212 COVID-19 hastasına ilişkin 16 değişkenden oluşan veri seti kullanılmıştır. Çalışma sonucunda SMOTE tekniğinin gözlem ve sınıf sayısının fazla olduğu veri setini dengelemede daha başarılı olduğu ve sentetik veri üretiminde hibrit tekniklere göre etkin olarak kullanılabileceği sonucuna ulaşılmıştır. KW - sentetik veri KW - smote KW - smoteenn KW - sağlık bilişimi N2 - There are ethical, bureaucratic and operational difficulties in obtaining and using personal health data in the areas that require the use of sensitive health data such as health care planning, clinical trials and research and development studies. The cost and time consuming of obtaining data from clinical and field studies, especially the restrictions on the security of electronic personal health records and personal data privacy, necessitate the production of synthetic data as close to real data. In this study, it is aimed to compare the performances of SMOTE, SMOTEENN, BorderlineSMOTE, SMOTETomek and ADASYN methods that have been used in synthetic data production by considering the importance of synthetic data generation in line with the increasing need for data use in the health field. In the study, a dataset consisting of 15 variables belonging to 390 patients with different observation and class numbers and a dataset consisting of 16 variables related to 19,212 COVID-19 patients were used. It has been concluded that SMOTE is more successful in balancing the data sets with large number of observations and multiclass classification. This technique can be used effectively in synthetic data generation compared to hybrid techniques. CR - [1] ReportLinker (2021). Big Data Industry. https://www.reportlinker.com/market-report/Advanced- IT/513221/Big-Data,20.07.2021 CR - [2] Gartner (2021). Top Strategic Technology Trends for 2021, https://www.gartner.com/en/publications/top-tech-trends-2021,13.07.2021 CR - [3] Jacob, P.D. (2020). Management of patient healthcare information: Healthcare-related information flow, access, and availability, In Fundamentals of Telemedicine and Telehealth (ss. 35-57) (Eds. Shashi Gogia), Academic Press. CR - [4] Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., & Sales, A. P. (2020). Generation and evaluation of synthetic patient data. BMC Medical Research Methodology, 20(1), 1–40. https://doi.org/10.1186/s12874-020-00977-1 CR - [5] Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. P. (2020). Generation and evaluation of privacy preserving synthetic health data. Neurocomputing, 416: 244–255. https://doi.org/10.1016/j.neucom.2019.12.136 CR - [6] Rocher, L., Hendrickx, J.M. & de Montjoye, YA. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun, 10: 3069. CR - [7] Tucker, A., Wang, Z., Rotalinti, Y., & Myles, P. (2020). Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. Npj Digital Medicine, 3(1). https://doi.org/10.1038/s41746-020-00353-9 CR - [8] Walonoski, J., Klaus, S., Granger, E., Hall, D., Gregorowicz, A., Neyarapally, G., Watson, A., & Eastman, J. (2020). SyntheaTM Novel coronavirus (COVID-19) model and synthetic data set. Intelligence- Based Medicine, 1–2: 100007. https://doi.org/10.1016/j.ibmed.2020.100007 CR - [9] Dube, K. , Gallagher, T. (2014). Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use. In: Gibbons J., MacCaull W. eds. Foundations of Health Information Engineering and Systems. FHIES 2013. Lecture Notes in Computer Science, vol 8315. Berlin, Heidelberg: Springer. CR - [10] Buczak, A. L., Babin, S., & Moniz, L. (2010). Data-driven approach for creating synthetic electronic medical records. BMC medical informatics and decision making, 10, 59. https://doi.org/10.1186/1472-6947-10-59 [11] Zeng, M., Zou, B., Wei, F., Liu, X., & Wang, L. (2016). Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. Proceedings of 2016 IEEE International Conference of Online Analysis and Computing Science, ICOACS 2016, 2016, 225–228. https://doi.org/10.1109/ICOACS.2016.7563084 CR - [12] Liu, N., Li, X., Qi, E., Xu, M., Li, L., & Gao, B. (2020). A novel ensemble learning paradigm for medical diagnosis with imbalanced data. IEEE Access, 8, 171263–171280. https://doi.org/10.1109/ACCESS.2020.3014362 CR - [13] Liu, Y., Li, X., Chen, X., Wang, X., & Li, H. (2020). High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance. Scientific Programming, 2020. https://doi.org/10.1155/2020/1953461 CR - [14] Gartner (2020). Hype Cycle for Data Science and MachineLearning-2020, https://www.gartner.com/en/documents/3988118/hype-cycle-for-data-science-and-machine-learning-2020, 19.07.2021 CR - [15] Ayala-Rivera, V., Portillo-Dominguez, A. O., Murphy, L., & Thorpe, C. (2016). COCOA: A synthetic data generator for testing anonymization techniques. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9867 LNCS, 163–177. https://doi.org/10.1007/978-3-319-45381-1_13 CR - [16] Marathe, M. V. (2006). Synthetic Data for Data Mining to Support Epidemiological Modeling. Network Dynamics and Simulation Science Laboratory, Virginia Tech, 1 Ağustos 2021 tarihinde https://www.cs.dartmouth.edu/~cbk/sdm06/marathe-data.sdm.pdf adresinden alındı. CR - [17] Emmert-Streib, F., Yang, Z., Feng, H., Tripathi, S., & Dehmer, M. (2020). An Introductory Review of Deep Learning for Prediction Models With Big Data. Frontiers in artificial intelligence, 3, 4. https://doi.org/10.3389/frai.2020.00004 CR - [18] Bekkar, M., & Alitouche, T. A. (2013). Imbalanced Data Learning Approaches Review. International Journal of Data Mining & Knowledge Management Process, 3(4). https://doi.org/10.5121/ijdkp.2013.3402 CR - [19] Murray, R. E., Ryan, P. B., & Reisinger, S. J. (2011). Design and validation of a data simulation model for longitudinal healthcare data. AMIA ... Annual Symposium proceedings. AMIA Symposium, 2011, 1176–1185. CR - [20] Rahman, M. M., & Davis, D. N. (2013). Addressing the Class Imbalance Problem in Medical Datasets. International Journal of Machine Learning and Computing, May 2014, 224–228. https://doi.org/10.7763/ijmlc.2013.v3.307 CR - [21] Riegler, G., Urschler, M., Ruther, M., Bischof, H., & Stern, D. (2015). Anatomical Landmark Detection in Medical Applications Driven by Synthetic Data. Proceedings of the IEEE International Conference on Computer Vision, 2015-February, 85–89. https://doi.org/10.1109/ICCVW.2015.21 CR - [22] Belarouci, S., & Chikh, M. A. (2017). Medical imbalanced data classification. Advances in Science, Technology and Engineering Systems, 2(3), 116–124. https://doi.org/10.25046/aj020316 CR - [23] Shamsuddin, R., Maweu, B. M., Li, M., & Prabhakaran, B. (2018). Virtual patient model: An approach for generating synthetic healthcare time series data. Proceedings - 2018 IEEE International Conference on Healthcare Informatics, ICHI 2018, February 2019, 208–218. https://doi.org/10.1109/ICHI.2018.00031 CR - [24] Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., & Greenspan, H. (2018). GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing, 321, 321–331. https://doi.org/10.1016/j.neucom.2018.09.013 CR - [25] Zhang, Z., Yan, C., Mesa, D. A., Sun, J., & Malin, B. A. (2020). Ensuring electronic medical record simulation through better training, modeling, and evaluation. Journal of the American Medical Informatics Association, 27(1). https://doi.org/10.1093/jamia/ocz161 CR - [26] Benaim, A. R., Almog, R., Gorelik, Y., Hochberg, I., Nassar, L., Mashiach, T., Khamaisi, M., Lurie, Y., Azzam, Z. S., Khoury, J., Kurnik, D., & Beyar, R. (2020). Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Medical Informatics, 8(2), 1–14. https://doi.org/10.2196/16492 CR - [27] Gherardini, M., Mazomenos, E., Menciassi, A., & Stoyanov, D. (2020). Catheter segmentation in X-ray fluoroscopy using synthetic data and transfer learning with light U-nets. Computer Methods and Programs in Biomedicine, 192, 105420. https://doi.org/10.1016/j.cmpb.2020.105420 CR - [28] Hernandez-Matamoros, A., Fujita, H., & Perez-Meana, H. (2020). A novel approach to create synthetic biomedical signals using BiRNN. Information Sciences, 541, 218–241. https://doi.org/10.1016/j.ins.2020.06.019 CR - [29] Shi, G., Wang, J., Qiang, Y., Yang, X., Zhao, J., Hao, R., Yang, W., Du, Q., & Kazihise, N. G. F. (2020). Knowledge-guided synthetic medical image adversarial augmentation for ultrasonography thyroid nodule classification. Computer Methods and Programs in Biomedicine, 196, 105611. https://doi.org/10.1016/j.cmpb.2020.105611 CR - [30] Stolfi, P., Valentini, I., Palumbo, M. C., Tieri, P., Grignolio, A., & Castiglione, F. (2020). Potential predictors of type-2 diabetes risk: machine learning, synthetic data and wearable health devices. BMC Bioinformatics, 21(17), 1–20. https://doi.org/10.1186/s12859-020-03763-4 CR - [31] Vaden, K. I., Gebregziabher, M., Dyslexia Data Consortium, & Eckert, M. A. (2020). Fully synthetic neuroimaging data for replication and exploration. NeuroImage, 223. https://doi.org/10.1016/j.neuroimage.2020.117284 CR - [32] Vilardell, M., Buxó, M., Clèries, R., Martínez, J. M., Garcia, G., Ameijide, A., Font, R., & Civit, S. (2020). Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival. Artificial Intelligence in Medicine, 107: 101875. https://doi.org/10.1016/j.artmed.2020.101875 CR - [33] Waheed, A., Goyal, M., Gupta, D., Khanna, A., Al-Turjman, F., & Pinheiro, P. R. (2020). CovidGAN: Data Augmentation Using Auxiliary Classifier GAN for Improved Covid-19 Detection. IEEE Access, 8: 91916–91923. https://doi.org/10.1109/ACCESS.2020.2994762 CR - [34] Dai, F., Song, Y., Si, W., Yang, G., Hu, J., & Wang, X. (2021). Improved CBSO: A distributed fuzzy-based adaptive synthetic oversampling algorithm for imbalanced judicial data. Information Sciences, 569, 70–89. https://doi.org/10.1016/j.ins.2021.04.017 CR - [35] Karbhari, Y., Basu, A., Geem, Z. W., Han, G. T., & Sarkar, R. (2021). Generation of synthetic chest X-ray images and detection of COVID-19: A deep learning based approach. Diagnostics, 11(5), 1–19. https://doi.org/10.3390/diagnostics11050895 CR - [36] Palmér, E., Karlsson, A., Nordström, F., Petruson, K., Siversson, C., Ljungberg, M., & Sohlin, M. (2021). Synthetic computed tomography data allows for accurate absorbed dose calculations in a magnetic resonance imaging only workflow for head and neck radiotherapy. Physics and Imaging in Radiation Oncology, 17(December 2020), 36–42. https://doi.org/10.1016/j.phro.2020.12.007 CR - [37] Vepa, A., Saleem, A., Rakhshan, K., Daneshkhah, A., Sedighi, T., Shohaimi, S., Omar, A., Salari, N., Chatrabgoun, O., Dharmaraj, D., Sami, J., Parekh, S., Ibrahim, M., Raza, M., Kapila, P., & Chakrabarti, P. (2021). Using machine learning algorithms to develop a clinical decision-making tool for covid-19 inpatients. International Journal of Environmental Research and Public Health, 18(12), 1–22. https://doi.org/10.3390/ijerph18126228 CR - [38] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). snopes.com: Two-Striped Telamonia Spider. Journal of Artificial Intelligence Research, 16(Sept. 28), 321–357. https://arxiv.org/pdf/1106.1813.pdf%0Ahttp://www.snopes.com/horrors/insects/telamonia.asp, CR - [39] Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N. & Asadpour, M. (2020). Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data, 7: 70. CR - [40] Susan, S. & Kumar, A. (2021). The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Engineering Reports, 3: e12298. https://doi.org/10.1002/eng2.12298 UR - https://doi.org/10.52693/jsas.1105599 L1 - https://dergipark.org.tr/tr/download/article-file/2382870 ER -