Medikal Sentetik Veri Üretimiyle Veri Dengelemesi

Ahmet Deveci; M. Fevzi Esen

doi:10.52693/jsas.1105599

TR EN

Medikal Sentetik Veri Üretimiyle Veri Dengelemesi

Öz

Sağlık hizmetleri planlaması, klinik deneyler ve araştırma geliştirme çalışmaları gibi sağlık verisi kullanımını gerektiren alanlarda, kişisel sağlık verisinin elde edilmesi ve kullanımında etik, bürokratik ve operasyonel zorluklar yaşanmaktadır. Elektronik kişisel sağlık kayıtlarının güvenliği ve kişisel veri mahremiyeti konularındaki kısıtlamalar başta olmak üzere, klinik ve saha çalışmalarından veri elde edilmesinin maliyetli ve zaman alıcı olması, gerçek veriye en yakın şekilde yapay veri üretilmesini gerekli kılmaktadır. Bu çalışmada, son dönemde sağlık alanında artan veri kullanımı ihtiyacı doğrultusunda, sentetik veri kullanımının önemi ele alınarak, sentetik veri üretiminde kullanılan SMOTE, SMOTEENN, BorderlineSMOTE, SMOTETomek ve ADASYN yöntemlerinin performanslarının karşılaştırılması amaçlanmıştır. Çalışmada, gözlem ve sınıf sayısı birbirinden farklı ve ikisi de kamuya açık, 390 hastaya ait 15 değişkenden oluşan veri seti ile 19.212 COVID-19 hastasına ilişkin 16 değişkenden oluşan veri seti kullanılmıştır. Çalışma sonucunda SMOTE tekniğinin gözlem ve sınıf sayısının fazla olduğu veri setini dengelemede daha başarılı olduğu ve sentetik veri üretiminde hibrit tekniklere göre etkin olarak kullanılabileceği sonucuna ulaşılmıştır.

Anahtar Kelimeler

Data Balancing with Synthetic Medical Data Generation

Öz

There are ethical, bureaucratic and operational difficulties in obtaining and using personal health data in the areas that require the use of sensitive health data such as health care planning, clinical trials and research and development studies. The cost and time consuming of obtaining data from clinical and field studies, especially the restrictions on the security of electronic personal health records and personal data privacy, necessitate the production of synthetic data as close to real data. In this study, it is aimed to compare the performances of SMOTE, SMOTEENN, BorderlineSMOTE, SMOTETomek and ADASYN methods that have been used in synthetic data production by considering the importance of synthetic data generation in line with the increasing need for data use in the health field. In the study, a dataset consisting of 15 variables belonging to 390 patients with different observation and class numbers and a dataset consisting of 16 variables related to 19,212 COVID-19 patients were used. It has been concluded that SMOTE is more successful in balancing the data sets with large number of observations and multiclass classification. This technique can be used effectively in synthetic data generation compared to hybrid techniques.

Anahtar Kelimeler

Kaynakça

[1] ReportLinker (2021). Big Data Industry. https://www.reportlinker.com/market-report/Advanced- IT/513221/Big-Data,20.07.2021
[2] Gartner (2021). Top Strategic Technology Trends for 2021, https://www.gartner.com/en/publications/top-tech-trends-2021,13.07.2021
[3] Jacob, P.D. (2020). Management of patient healthcare information: Healthcare-related information flow, access, and availability, In Fundamentals of Telemedicine and Telehealth (ss. 35-57) (Eds. Shashi Gogia), Academic Press.
[4] Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., & Sales, A. P. (2020). Generation and evaluation of synthetic patient data. BMC Medical Research Methodology, 20(1), 1–40. https://doi.org/10.1186/s12874-020-00977-1
[5] Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. P. (2020). Generation and evaluation of privacy preserving synthetic health data. Neurocomputing, 416: 244–255. https://doi.org/10.1016/j.neucom.2019.12.136
[6] Rocher, L., Hendrickx, J.M. & de Montjoye, YA. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun, 10: 3069.
[7] Tucker, A., Wang, Z., Rotalinti, Y., & Myles, P. (2020). Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. Npj Digital Medicine, 3(1). https://doi.org/10.1038/s41746-020-00353-9
[8] Walonoski, J., Klaus, S., Granger, E., Hall, D., Gregorowicz, A., Neyarapally, G., Watson, A., & Eastman, J. (2020). SyntheaTM Novel coronavirus (COVID-19) model and synthetic data set. Intelligence- Based Medicine, 1–2: 100007. https://doi.org/10.1016/j.ibmed.2020.100007

[9] Dube, K. , Gallagher, T. (2014). Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use. In: Gibbons J., MacCaull W. eds. Foundations of Health Information Engineering and Systems. FHIES 2013. Lecture Notes in Computer Science, vol 8315. Berlin, Heidelberg: Springer.
[10] Buczak, A. L., Babin, S., & Moniz, L. (2010). Data-driven approach for creating synthetic electronic medical records. BMC medical informatics and decision making, 10, 59. https://doi.org/10.1186/1472-6947-10-59 [11] Zeng, M., Zou, B., Wei, F., Liu, X., & Wang, L. (2016). Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. Proceedings of 2016 IEEE International Conference of Online Analysis and Computing Science, ICOACS 2016, 2016, 225–228. https://doi.org/10.1109/ICOACS.2016.7563084
[12] Liu, N., Li, X., Qi, E., Xu, M., Li, L., & Gao, B. (2020). A novel ensemble learning paradigm for medical diagnosis with imbalanced data. IEEE Access, 8, 171263–171280. https://doi.org/10.1109/ACCESS.2020.3014362
[13] Liu, Y., Li, X., Chen, X., Wang, X., & Li, H. (2020). High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance. Scientific Programming, 2020. https://doi.org/10.1155/2020/1953461
[14] Gartner (2020). Hype Cycle for Data Science and MachineLearning-2020, https://www.gartner.com/en/documents/3988118/hype-cycle-for-data-science-and-machine-learning-2020, 19.07.2021
[15] Ayala-Rivera, V., Portillo-Dominguez, A. O., Murphy, L., & Thorpe, C. (2016). COCOA: A synthetic data generator for testing anonymization techniques. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9867 LNCS, 163–177. https://doi.org/10.1007/978-3-319-45381-1_13
[16] Marathe, M. V. (2006). Synthetic Data for Data Mining to Support Epidemiological Modeling. Network Dynamics and Simulation Science Laboratory, Virginia Tech, 1 Ağustos 2021 tarihinde https://www.cs.dartmouth.edu/~cbk/sdm06/marathe-data.sdm.pdf adresinden alındı.
[17] Emmert-Streib, F., Yang, Z., Feng, H., Tripathi, S., & Dehmer, M. (2020). An Introductory Review of Deep Learning for Prediction Models With Big Data. Frontiers in artificial intelligence, 3, 4. https://doi.org/10.3389/frai.2020.00004
[18] Bekkar, M., & Alitouche, T. A. (2013). Imbalanced Data Learning Approaches Review. International Journal of Data Mining & Knowledge Management Process, 3(4). https://doi.org/10.5121/ijdkp.2013.3402
[19] Murray, R. E., Ryan, P. B., & Reisinger, S. J. (2011). Design and validation of a data simulation model for longitudinal healthcare data. AMIA ... Annual Symposium proceedings. AMIA Symposium, 2011, 1176–1185.
[20] Rahman, M. M., & Davis, D. N. (2013). Addressing the Class Imbalance Problem in Medical Datasets. International Journal of Machine Learning and Computing, May 2014, 224–228. https://doi.org/10.7763/ijmlc.2013.v3.307
[21] Riegler, G., Urschler, M., Ruther, M., Bischof, H., & Stern, D. (2015). Anatomical Landmark Detection in Medical Applications Driven by Synthetic Data. Proceedings of the IEEE International Conference on Computer Vision, 2015-February, 85–89. https://doi.org/10.1109/ICCVW.2015.21
[22] Belarouci, S., & Chikh, M. A. (2017). Medical imbalanced data classification. Advances in Science, Technology and Engineering Systems, 2(3), 116–124. https://doi.org/10.25046/aj020316
[23] Shamsuddin, R., Maweu, B. M., Li, M., & Prabhakaran, B. (2018). Virtual patient model: An approach for generating synthetic healthcare time series data. Proceedings - 2018 IEEE International Conference on Healthcare Informatics, ICHI 2018, February 2019, 208–218. https://doi.org/10.1109/ICHI.2018.00031
[24] Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., & Greenspan, H. (2018). GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing, 321, 321–331. https://doi.org/10.1016/j.neucom.2018.09.013
[25] Zhang, Z., Yan, C., Mesa, D. A., Sun, J., & Malin, B. A. (2020). Ensuring electronic medical record simulation through better training, modeling, and evaluation. Journal of the American Medical Informatics Association, 27(1). https://doi.org/10.1093/jamia/ocz161
[26] Benaim, A. R., Almog, R., Gorelik, Y., Hochberg, I., Nassar, L., Mashiach, T., Khamaisi, M., Lurie, Y., Azzam, Z. S., Khoury, J., Kurnik, D., & Beyar, R. (2020). Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Medical Informatics, 8(2), 1–14. https://doi.org/10.2196/16492
[27] Gherardini, M., Mazomenos, E., Menciassi, A., & Stoyanov, D. (2020). Catheter segmentation in X-ray fluoroscopy using synthetic data and transfer learning with light U-nets. Computer Methods and Programs in Biomedicine, 192, 105420. https://doi.org/10.1016/j.cmpb.2020.105420
[28] Hernandez-Matamoros, A., Fujita, H., & Perez-Meana, H. (2020). A novel approach to create synthetic biomedical signals using BiRNN. Information Sciences, 541, 218–241. https://doi.org/10.1016/j.ins.2020.06.019
[29] Shi, G., Wang, J., Qiang, Y., Yang, X., Zhao, J., Hao, R., Yang, W., Du, Q., & Kazihise, N. G. F. (2020). Knowledge-guided synthetic medical image adversarial augmentation for ultrasonography thyroid nodule classification. Computer Methods and Programs in Biomedicine, 196, 105611. https://doi.org/10.1016/j.cmpb.2020.105611
[30] Stolfi, P., Valentini, I., Palumbo, M. C., Tieri, P., Grignolio, A., & Castiglione, F. (2020). Potential predictors of type-2 diabetes risk: machine learning, synthetic data and wearable health devices. BMC Bioinformatics, 21(17), 1–20. https://doi.org/10.1186/s12859-020-03763-4
[31] Vaden, K. I., Gebregziabher, M., Dyslexia Data Consortium, & Eckert, M. A. (2020). Fully synthetic neuroimaging data for replication and exploration. NeuroImage, 223. https://doi.org/10.1016/j.neuroimage.2020.117284
[32] Vilardell, M., Buxó, M., Clèries, R., Martínez, J. M., Garcia, G., Ameijide, A., Font, R., & Civit, S. (2020). Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival. Artificial Intelligence in Medicine, 107: 101875. https://doi.org/10.1016/j.artmed.2020.101875
[33] Waheed, A., Goyal, M., Gupta, D., Khanna, A., Al-Turjman, F., & Pinheiro, P. R. (2020). CovidGAN: Data Augmentation Using Auxiliary Classifier GAN for Improved Covid-19 Detection. IEEE Access, 8: 91916–91923. https://doi.org/10.1109/ACCESS.2020.2994762
[34] Dai, F., Song, Y., Si, W., Yang, G., Hu, J., & Wang, X. (2021). Improved CBSO: A distributed fuzzy-based adaptive synthetic oversampling algorithm for imbalanced judicial data. Information Sciences, 569, 70–89. https://doi.org/10.1016/j.ins.2021.04.017
[35] Karbhari, Y., Basu, A., Geem, Z. W., Han, G. T., & Sarkar, R. (2021). Generation of synthetic chest X-ray images and detection of COVID-19: A deep learning based approach. Diagnostics, 11(5), 1–19. https://doi.org/10.3390/diagnostics11050895
[36] Palmér, E., Karlsson, A., Nordström, F., Petruson, K., Siversson, C., Ljungberg, M., & Sohlin, M. (2021). Synthetic computed tomography data allows for accurate absorbed dose calculations in a magnetic resonance imaging only workflow for head and neck radiotherapy. Physics and Imaging in Radiation Oncology, 17(December 2020), 36–42. https://doi.org/10.1016/j.phro.2020.12.007
[37] Vepa, A., Saleem, A., Rakhshan, K., Daneshkhah, A., Sedighi, T., Shohaimi, S., Omar, A., Salari, N., Chatrabgoun, O., Dharmaraj, D., Sami, J., Parekh, S., Ibrahim, M., Raza, M., Kapila, P., & Chakrabarti, P. (2021). Using machine learning algorithms to develop a clinical decision-making tool for covid-19 inpatients. International Journal of Environmental Research and Public Health, 18(12), 1–22. https://doi.org/10.3390/ijerph18126228
[38] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). snopes.com: Two-Striped Telamonia Spider. Journal of Artificial Intelligence Research, 16(Sept. 28), 321–357. https://arxiv.org/pdf/1106.1813.pdf%0Ahttp://www.snopes.com/horrors/insects/telamonia.asp,
[39] Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N. & Asadpour, M. (2020). Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data, 7: 70.
[40] Susan, S. & Kumar, A. (2021). The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Engineering Reports, 3: e12298. https://doi.org/10.1002/eng2.12298

Ayrıntılar

Birincil Dil

Türkçe

Konular

-

Bölüm

Araştırma Makalesi

Yazarlar

Ahmet Deveci
0000-0002-3044-8397
Türkiye

M. Fevzi Esen ^*
0000-0001-7823-0883
Türkiye

Yayımlanma Tarihi

30 Haziran 2022

Gönderilme Tarihi

20 Nisan 2022

Kabul Tarihi

26 Haziran 2022

Yayımlandığı Sayı

Yıl 2022 Sayı: 5

DOI

https://doi.org/10.52693/jsas.1105599

IZ

https://izlik.org/JA23LJ66BL

Kaynak Göster

RIS / Bibtex

APA

Deveci, A., & Esen, M. F. (2022). Medikal Sentetik Veri Üretimiyle Veri Dengelemesi. Journal of Statistics and Applied Sciences, 5, 17-27. https://doi.org/10.52693/jsas.1105599

AMA

1.Deveci A, Esen MF. Medikal Sentetik Veri Üretimiyle Veri Dengelemesi. JSAS. 2022;(5):17-27. doi:10.52693/jsas.1105599

Chicago

Deveci, Ahmet, ve M. Fevzi Esen. 2022. “Medikal Sentetik Veri Üretimiyle Veri Dengelemesi”. Journal of Statistics and Applied Sciences, sy 5: 17-27. https://doi.org/10.52693/jsas.1105599.

EndNote

Deveci A, Esen MF (01 Haziran 2022) Medikal Sentetik Veri Üretimiyle Veri Dengelemesi. Journal of Statistics and Applied Sciences 5 17–27.

IEEE

[1]A. Deveci ve M. F. Esen, “Medikal Sentetik Veri Üretimiyle Veri Dengelemesi”, JSAS, sy 5, ss. 17–27, Haz. 2022, doi: 10.52693/jsas.1105599.

ISNAD

Deveci, Ahmet - Esen, M. Fevzi. “Medikal Sentetik Veri Üretimiyle Veri Dengelemesi”. Journal of Statistics and Applied Sciences. 5 (01 Haziran 2022): 17-27. https://doi.org/10.52693/jsas.1105599.

JAMA

1.Deveci A, Esen MF. Medikal Sentetik Veri Üretimiyle Veri Dengelemesi. JSAS. 2022;:17–27.

MLA

Deveci, Ahmet, ve M. Fevzi Esen. “Medikal Sentetik Veri Üretimiyle Veri Dengelemesi”. Journal of Statistics and Applied Sciences, sy 5, Haziran 2022, ss. 17-27, doi:10.52693/jsas.1105599.

Vancouver

1.Ahmet Deveci, M. Fevzi Esen. Medikal Sentetik Veri Üretimiyle Veri Dengelemesi. JSAS. 01 Haziran 2022;(5):17-2. doi:10.52693/jsas.1105599