Thyroid Cancer Recurrence Prediction: A Comparison of Synthetic and Real Data
Year 2026,
Volume: 9 Issue: 2, 608 - 615, 15.03.2026
Emrullah Gazioğlu
Abstract
Data scarcity and/or privacy constraints in medical research and many other research fields pose obstacles to the development of artificial intelligence (machine learning, deep learning) models for various studies. In this study, it has been demonstrated how synthetic data generation can provide a solution for augmenting small datasets. The Differentiated Thyroid Cancer Recurrence Dataset obtained from the UCI Machine Learning Repository was used. Large-scale synthetic data preserving the statistical characteristics of the original data was generated and class imbalance was improved. Performance evaluation was conducted with XGBoost, LightGBM, k-Nearest Neighbors (kNN) and Decision Tree (DT) algorithms. Looking at the results, it was revealed that models trained with synthetic data showed comparable or better performance than original data. Consistent performance was obtained in ensemble methods, while significant improvements were achieved in simple models. Stability analysis showed that XGBoost and LightGBM were the most consistent models.
Ethical Statement
Since the dataset used in this study consists of anonymized patient records that are publicly available through the UCI Machine Learning Repository, ethics committee approval is not required.
Thanks
We thank the researchers who publicly provided the Differentiated Thyroid Cancer Recurrence dataset used in this study.
References
-
Bender, S., Jarmin, R. S., Kreuter, F., & Lane, J. (2020). Privacy and confidentiality. In Big data and social science (pp. 313–331). Chapman and Hall/CRC.
-
Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., & Kasneci, G. (2022). Language models are realistic tabular data generators. arXiv. https://doi.org/10.48550/arXiv.2210.06280
-
Borzooei, S., Briganti, G., Golparian, M., Lechien, J. R., & Tarokhian, A. (2024). Machine learning for risk stratification of thyroid cancer patients: A 15-year cohort study. European Archives of Oto-Rhino-Laryngology, 281(4), 2095–2104. https://doi.org/10.1007/s00405-023-08766-2
-
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
-
Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K., & Mahmood, F. (2021). Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6), 493–497. https://doi.org/10.1038/s41551-021-00751-8
-
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). ACM. https://doi.org/10.1145/2939672.2939785
-
DataXID. (2024). DataXID: Blockchain-powered synthetic data platform. https://dataxid.com
-
Donaldson, M. S., & Lohr, K. N. (Eds.). (1994). Health data in the information age: Use, disclosure, and privacy. National Academies Press.
-
El Emam, K., Mosquera, L., & Bass, J. (2020b). Evaluating identity disclosure risk in fully synthetic health data: Model development and validation. Journal of Medical Internet Research, 22(11), e23139. https://doi.org/10.2196/23139
-
El Emam, K., Mosquera, L., & Hoptroff, R. (2020a). Practical synthetic data generation: Balancing privacy and the broad availability of data. O'Reilly Media.
-
El Mestari, S. Z., Lenzini, G., & Demirci, H. (2024). Preserving data privacy in machine learning systems. Computers & Security, 137, 103605. https://doi.org/10.1016/j.cose.2023.103605
-
Habchi, Y., Himeur, Y., Kheddar, H., Boukabou, A., Atalla, S., Chaker, D., & Mansoor, W. (2023). AI in thyroid cancer diagnosis: Techniques, trends, and future directions. Systems, 11(10), 519. https://doi.org/10.3390/systems11100519
-
Haugen, B. R., Alexander, E. K., Bible, K. C., Doherty, G. M., Mandel, S. J., Nikiforov, Y. E., Pacini, F., Randolph, G. W., Sawka, A. M., Schlumberger, M., Schuff, K. G., Sherman, S. I., Sosa, J. A., Steward, D. L., Tuttle, R. M., & Wartofsky, L. (2016). 2015 American Thyroid Association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer. Thyroid, 26(1), 1–133. https://doi.org/10.1089/thy.2015.0020
-
Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N., & Weller, A. (2022). Synthetic data: What, why and how? arXiv. https://doi.org/10.48550/arXiv.2205.03257
-
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30 (pp. 3146–3154). Curran Associates.
-
Lobato de Faria, P., & Cordeiro, J. V. (2014). Health data privacy and confidentiality rights: Crisis or redemption? Revista Portuguesa de Saúde Pública, 32(2), 123–133. https://doi.org/10.1016/j.rpsp.2014.10.001
-
Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347–1358. https://doi.org/10.1056/NEJMra1814259
-
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 60. https://doi.org/10.1186/s40537-019-0197-0
-
Siegel, R. L., Miller, K. D., Wagle, N. S., & Jemal, A. (2023). Cancer statistics, 2023. CA: A Cancer Journal for Clinicians, 73(1), 17–48. https://doi.org/10.3322/caac.21763
-
Voigt, P., & Von dem Bussche, A. (2017). The EU General Data Protection Regulation (GDPR): A practical guide. Springer International Publishing. https://doi.org/10.1007/978-3-319-57959-7
-
Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems 32 (pp. 7335–7345). Curran Associates.
-
Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. P. (2020). Generation and evaluation of privacy preserving synthetic health data. Neurocomputing, 416, 244–255. https://doi.org/10.1016/j.neucom.2020.07.134.
Tiroit Kanseri Nüks Tahmini: Sentetik ve Gerçek Veri Karşılaştırması
Year 2026,
Volume: 9 Issue: 2, 608 - 615, 15.03.2026
Emrullah Gazioğlu
Abstract
Tıbbi araştırmalar başta olmak üzere diğer birçok araştırma dallarında veri kıtlığı ve/veya gizlilik kısıtlamaları, çeşitli çalışmalar için yapay zekâ (makine öğrenmesi, derin öğrenme) modellerinin geliştirilmesinde engeller oluşturmaktadır. Bu çalışmada, sentetik veri üretiminin küçük veri setlerini büyütmede nasıl bir çözüm üretebileceği gösterilmiştir. UCI Makine Öğrenmesi Deposu'ndan elde edilen Farklılaşmış Tiroit Kanseri Nüks Veri Seti kullanılmıştır. Orijinal verinin istatistiksel özelliklerini koruyan büyük ölçekli sentetik veri üretilmiş ve sınıf dengesizliği iyileştirilmiştir. XGBoost, LightGBM, k-En Yakın Komşu (kEYK) ve Karar Ağacı (KA) algoritmaları ile performans değerlendirmesi yapılmıştır. Sonuçlara bakıldığında, sentetik veri ile eğitilen modellerin orijinal veri ile karşılaştırılabilir veya daha iyi performans gösterdiği ortaya çıkmıştır. Topluluk yöntemlerinde tutarlı performans, basit modellerde ise kayda değer iyileşmeler elde edilmiştir. Kararlılık analizi, XGBoost ve LightGBM'in en tutarlı modeller olduğunu göstermiştir.
Ethical Statement
Bu çalışmada kullanılan veri seti, UCI Makine Öğrenmesi Deposu’ndan halka açık olarak erişilebilen, anonim hasta verilerinden oluştuğundan etik kurul onayı gerekmemektedir.
Thanks
Bu çalışmada kullanılan Farklılaşmış Tiroit Kanseri Nüks veri setini halka açık olarak sunan araştırmacılara teşekkür ederiz. Orijinal tiroit kanseri veri seti UCI Makine Öğrenmesi Deposu’nda halka açıktır (https://archive.ics.uci.edu). Sentetik veri, DataXid platformu ile üretilmiştir ve gizlilik nedeniyle halka açık değildir. Model eğitim kodları, makul talep üzerine sorumlu yazardan temin edilebilir.
References
-
Bender, S., Jarmin, R. S., Kreuter, F., & Lane, J. (2020). Privacy and confidentiality. In Big data and social science (pp. 313–331). Chapman and Hall/CRC.
-
Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., & Kasneci, G. (2022). Language models are realistic tabular data generators. arXiv. https://doi.org/10.48550/arXiv.2210.06280
-
Borzooei, S., Briganti, G., Golparian, M., Lechien, J. R., & Tarokhian, A. (2024). Machine learning for risk stratification of thyroid cancer patients: A 15-year cohort study. European Archives of Oto-Rhino-Laryngology, 281(4), 2095–2104. https://doi.org/10.1007/s00405-023-08766-2
-
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
-
Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K., & Mahmood, F. (2021). Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6), 493–497. https://doi.org/10.1038/s41551-021-00751-8
-
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). ACM. https://doi.org/10.1145/2939672.2939785
-
DataXID. (2024). DataXID: Blockchain-powered synthetic data platform. https://dataxid.com
-
Donaldson, M. S., & Lohr, K. N. (Eds.). (1994). Health data in the information age: Use, disclosure, and privacy. National Academies Press.
-
El Emam, K., Mosquera, L., & Bass, J. (2020b). Evaluating identity disclosure risk in fully synthetic health data: Model development and validation. Journal of Medical Internet Research, 22(11), e23139. https://doi.org/10.2196/23139
-
El Emam, K., Mosquera, L., & Hoptroff, R. (2020a). Practical synthetic data generation: Balancing privacy and the broad availability of data. O'Reilly Media.
-
El Mestari, S. Z., Lenzini, G., & Demirci, H. (2024). Preserving data privacy in machine learning systems. Computers & Security, 137, 103605. https://doi.org/10.1016/j.cose.2023.103605
-
Habchi, Y., Himeur, Y., Kheddar, H., Boukabou, A., Atalla, S., Chaker, D., & Mansoor, W. (2023). AI in thyroid cancer diagnosis: Techniques, trends, and future directions. Systems, 11(10), 519. https://doi.org/10.3390/systems11100519
-
Haugen, B. R., Alexander, E. K., Bible, K. C., Doherty, G. M., Mandel, S. J., Nikiforov, Y. E., Pacini, F., Randolph, G. W., Sawka, A. M., Schlumberger, M., Schuff, K. G., Sherman, S. I., Sosa, J. A., Steward, D. L., Tuttle, R. M., & Wartofsky, L. (2016). 2015 American Thyroid Association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer. Thyroid, 26(1), 1–133. https://doi.org/10.1089/thy.2015.0020
-
Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N., & Weller, A. (2022). Synthetic data: What, why and how? arXiv. https://doi.org/10.48550/arXiv.2205.03257
-
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30 (pp. 3146–3154). Curran Associates.
-
Lobato de Faria, P., & Cordeiro, J. V. (2014). Health data privacy and confidentiality rights: Crisis or redemption? Revista Portuguesa de Saúde Pública, 32(2), 123–133. https://doi.org/10.1016/j.rpsp.2014.10.001
-
Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347–1358. https://doi.org/10.1056/NEJMra1814259
-
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 60. https://doi.org/10.1186/s40537-019-0197-0
-
Siegel, R. L., Miller, K. D., Wagle, N. S., & Jemal, A. (2023). Cancer statistics, 2023. CA: A Cancer Journal for Clinicians, 73(1), 17–48. https://doi.org/10.3322/caac.21763
-
Voigt, P., & Von dem Bussche, A. (2017). The EU General Data Protection Regulation (GDPR): A practical guide. Springer International Publishing. https://doi.org/10.1007/978-3-319-57959-7
-
Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems 32 (pp. 7335–7345). Curran Associates.
-
Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. P. (2020). Generation and evaluation of privacy preserving synthetic health data. Neurocomputing, 416, 244–255. https://doi.org/10.1016/j.neucom.2020.07.134.