Research Article
BibTex RIS Cite

Synthetic Data Generation with Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling

Year 2024, Volume: 14 Issue: 4, 1408 - 1431, 01.12.2024
https://doi.org/10.21597/jist.1495455

Abstract

Machine learning is a powerful decision support system used in analyzing and evaluating real-life data. This system aims to create new solutions and improve performance. Therefore, it is related to the field of data science. There are data on the basis of this relationship The effectiveness of drawing meaningful insights from data depends on the quality of the model's training. To improve this performance, the variety of combinations among the data and the total number of data in the dataset should be increased. But in this topic, insufficient data access, legal regulations, ethical rules, confidentiality procedures, privacy, data sharing restrictions and cost parameters are obstacles. Synthetic data generation is a basic step in the field of data science in order to solve all these problems, improve functionality and provide powerful machine-learning inferences. Therefore, a new synthetic data generation approach consisting of 3 basic stages is proposed in this study. In the first stage, synthetic data production similar to the distribution of the original data was carried out with the modified ABC (Artificial Bee Colony) optimization algorithm. In the second stage, the category information of the independent variables was determined by the statistical evaluation analyzed with regression methods among the artificial data produced. In the third stage, the efficiency and applicability of the artificial data produced were evaluated with supervised machine learning classifiers. As a result of the evaluation, it has been proven that the proposed synthetic data generation approach improves the performance of machine learning classifiers in proportion to the increasing number of data. The decision tree algorithm that showed maximum performance produced success rates of 100%, 92.5%, 100%, 85%, and 66% on 5 separate enriched datasets, respectively.

References

  • Akalın, F., & Yumuşak, N. (2022). DNA genom dizilimi üzerinde dijital sinyal işleme teknikleri kullanılarak elde edilen ekson ve intron bölgelerinin EfficientNetB7 mimarisi ile sınıflandırılması. Gazi Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi, 37(3), 1355–1371. https://doi.org/10.17341/gazimmfd.900987.
  • Akay, B., Karaboga, D., Gorkemli, B., & Kaya, E. (2021). A survey on the artificial bee colony algorithm variants for binary, integer, and mixed integer programming problems. Applied Soft Computing, 106, 1–35. https://doi.org/10.1016/j.asoc.2021.107351.
  • Alvarado-Iniesta, A., Garcia-Alcaraz, J. L., Rodriguez-Borbon, M. I., & Maldonado, A. (2013). Optimization of the material flow in a manufacturing plant by use of artificial bee colony algorithm. Expert Systems with Applications, 40, 4785–4790. https://doi.org/10.1016/j.eswa.2013.02.029.
  • Arab, N., Nemmour, H., & Chibani, Y. (2023). A new synthetic feature generation scheme based on artificial immune systems for robust offline signature verification. Expert Systems with Applications, 213. https://doi.org/10.1016/j.eswa.2022.119306.
  • Brnabic, A., & Hess, L. M. (2021). Systematic literature review of machine learning methods used in the analysis of real-world data for patient-provider decision making. BMC Medical Informatics and Decision Making, 21. https://doi.org/10.1186/s12911-021-01403-2.
  • Chicco, D., Warrens, M. J., & Jurman, G. (2021). The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Computer Science, 7, 1–24. https://doi.org/10.7717/PEERJ-CS.623.
  • Dahmen, J., & Cook, D. (2019). SynSys: A synthetic data generation system for healthcare applications. Sensors, 19(5), 1–11. https://doi.org/10.3390/s19051181.
  • Dankar, F. K., & Ibrahim, M. (2021). Fake it till you make it: Guidelines for effective synthetic data generation. Applied Sciences, 11, 1–18. https://doi.org/10.3390/app11052158.
  • Douzas, G., Lechleitner, M., & Bacao, F. (2022). Improving the quality of predictive models in small data GSDOT: A new algorithm for generating synthetic data. PLoS One, 17(4), 1–15. https://doi.org/10.1371/journal.pone.0265626.
  • El Mrabet, M. A., El Makkaoui, K., & Faize, A. (2021). Supervised machine learning: A survey. In Proceedings of the 4th International Conference on Advanced Communication Technologies and Networking (CommNet 2021). https://doi.org/10.1109/CommNet52204.2021.9641998.
  • Hashimoto, D. A., Ward, T. M., & Meireles, O. R. (2020). The role of artificial intelligence in surgery. Advances in Surgery, 54, 89–101. https://doi.org/10.1016/j.yasu.2020.05.010.
  • Karaboğa, D. (2020). Yapay Zeka Optimizasyon Algoritmaları,Nobel Akademik Yayıncılık, 7. Baskı. Kaya, E., Gorkemli, B., Akay, B., & Karaboga, D. (2022). A review on the studies employing artificial bee colony algorithm to solve combinatorial optimization problems. Engineering Applications of Artificial Intelligence, 115. https://doi.org/10.1016/j.engappai.2022.105311.
  • Kinaneva, D., Hristov, G., Kyuchukov, P., Georgiev, G., Zahariev, P., & Daskalov, R. (2021). Machine learning algorithms for regression analysis and predictions of numerical data. In 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA).
  • Li, M., Zhuang, D., & Chang, J. M. (2023). MC-GEN: Multi-level clustering for private synthetic data generation. Knowledge-Based Systems, 264, 1–11. https://doi.org/10.1016/j.knosys.2022.110239.
  • Li, Z., Zhao, Y., & Fu, J. (2020). SynC: A copula-based framework for generating synthetic data from aggregated sources. In 2020 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 571–578). https://doi.org/10.1109/ICDMW51313.2020.00082.
  • Lu, Y., Shen, M., Wang, H., & Wei, W. (2021). Machine learning for synthetic data generation: A review. Journal of Big Data, 14(8), 1–18.
  • Parhi, S. K., & Patro, S. K. (2023). Prediction of compressive strength of geopolymer concrete using a hybrid ensemble of grey wolf optimized machine learning estimators. Journal of Building Engineering, 71. https://doi.org/10.1016/j.jobe.2023.106521.
  • Patki, N., Wedge, R., & Veeramachaneni, K. (2016). The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 399–410). https://doi.org/10.1109/DSAA.2016.49.
  • Ping, H., Stoyanovich, J., & Howe, B. (2017). DataSynthesizer: Privacy-preserving synthetic datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management (SSDBM ’17) (pp. 1–5). https://doi.org/10.1145/3085504.3091117.
  • UCI. (2024a). (the University of California Irvine Machine Learning Repository). https://archive.ics.uci.edu/.
  • UCI. (2024b). (the University of California Irvine Machine Learning Repository)- Lenses. https://archive.ics.uci.edu/dataset/58/lenses.
  • UCI. (2024c). (the University of California Irvine Machine Learning Repository)- COVID-19 Surveillance. https://archive.ics.uci.edu/dataset/567/covid+19+surveillance.
  • UCI. (2024d). (the University of California Irvine Machine Learning Repository)- Balloons. https://archive.ics.uci.edu/dataset/13/balloons.
  • UCI. (2024e). (the University of California Irvine Machine Learning Repository)- Caesarian Section. https://archive.ics.uci.edu/dataset/472/caesarian+section+classification+dataset.
  • UCI. (2024f). (the University of California Irvine Machine Learning Repository)- Post-Operative Patient. https://archive.ics.uci.edu/dataset/82/post+operative+patient.
  • Wharrie, S., et al. (2022). HAPNEST: An efficient tool for generating large-scale genetics datasets from limited training data. In NeurIPS 2022 Workshop on Synthetic Data Empowering Machine Learning Research (pp. 1–7).

Değiştirilmiş Yapay Arı Kolonisi Optimizasyon Algoritması ve İstatistiksel Modelleme ile Sentetik Veri Üretimi

Year 2024, Volume: 14 Issue: 4, 1408 - 1431, 01.12.2024
https://doi.org/10.21597/jist.1495455

Abstract

Makine öğrenmesi, gerçek yaşam verilerini analiz etmede ve değerlendirmede kullanılan güçlü bir karar destek sistemidir. Bu sistem, yeni çözümler üretmeyi ve performansı iyileştirmeyi amaçlamaktadır. Bu nedenle, veri bilimi alanıyla ilişkilidir. Bu ilişki temelinde veriler vardır. Verilerden anlamlı içgörüler çıkarma etkinliği, model eğitiminin kalitesine bağlıdır. Bu performansı iyileştirmek için, veriler arasındaki kombinasyonların çeşitliliği ve veri kümesindeki toplam veri sayısı artırılmalıdır. Ancak bu konuda, yetersiz veri erişimi, yasal düzenlemeler, etik kurallar, gizlilik prosedürleri, gizlilik, veri paylaşımı kısıtlamaları ve maliyet parametreleri engellerdir. Tüm bu sorunları çözmek, işlevselliği iyileştirmek ve güçlü makine öğrenimi çıkarımları sağlamak için sentetik veri üretimi, veri bilimi alanında temel bir adımdır. Bu nedenle, bu çalışmada 3 temel aşamadan oluşan yeni bir sentetik veri üretimi yaklaşımı önerilmiştir. İlk aşamada, orijinal verilerin dağılımına benzer şekilde sentetik veri üretimi, modifiye edilmiş ABC (Yapay Arı Kolonisi) optimizasyon algoritması ile gerçekleştirilmiştir. İkinci aşamada, üretilen yapay veriler arasında regresyon yöntemleriyle analiz edilen istatistiksel değerlendirme ile bağımsız değişkenlerin kategori bilgileri belirlenmiştir. Üçüncü aşamada, üretilen yapay verilerin etkinliği ve uygulanabilirliği, makine öğrenimi sınıflandırıcıları ile değerlendirilmiştir. Değerlendirme sonucunda, önerilen sentetik veri üretim yönteminin, veri sayısının artışı ile orantılı olarak makine öğrenmesi sınıflandırıcılarının performansını artırdığı kanıtlanmıştır Maksimum performans gösteren karar ağacı algoritması, zenginleştirilmiş 5 ayrı veri kümesi üzerinde sırasıyla %100, %92.5, %100, %85, %66 başarı oranları üretmiştir.

References

  • Akalın, F., & Yumuşak, N. (2022). DNA genom dizilimi üzerinde dijital sinyal işleme teknikleri kullanılarak elde edilen ekson ve intron bölgelerinin EfficientNetB7 mimarisi ile sınıflandırılması. Gazi Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi, 37(3), 1355–1371. https://doi.org/10.17341/gazimmfd.900987.
  • Akay, B., Karaboga, D., Gorkemli, B., & Kaya, E. (2021). A survey on the artificial bee colony algorithm variants for binary, integer, and mixed integer programming problems. Applied Soft Computing, 106, 1–35. https://doi.org/10.1016/j.asoc.2021.107351.
  • Alvarado-Iniesta, A., Garcia-Alcaraz, J. L., Rodriguez-Borbon, M. I., & Maldonado, A. (2013). Optimization of the material flow in a manufacturing plant by use of artificial bee colony algorithm. Expert Systems with Applications, 40, 4785–4790. https://doi.org/10.1016/j.eswa.2013.02.029.
  • Arab, N., Nemmour, H., & Chibani, Y. (2023). A new synthetic feature generation scheme based on artificial immune systems for robust offline signature verification. Expert Systems with Applications, 213. https://doi.org/10.1016/j.eswa.2022.119306.
  • Brnabic, A., & Hess, L. M. (2021). Systematic literature review of machine learning methods used in the analysis of real-world data for patient-provider decision making. BMC Medical Informatics and Decision Making, 21. https://doi.org/10.1186/s12911-021-01403-2.
  • Chicco, D., Warrens, M. J., & Jurman, G. (2021). The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Computer Science, 7, 1–24. https://doi.org/10.7717/PEERJ-CS.623.
  • Dahmen, J., & Cook, D. (2019). SynSys: A synthetic data generation system for healthcare applications. Sensors, 19(5), 1–11. https://doi.org/10.3390/s19051181.
  • Dankar, F. K., & Ibrahim, M. (2021). Fake it till you make it: Guidelines for effective synthetic data generation. Applied Sciences, 11, 1–18. https://doi.org/10.3390/app11052158.
  • Douzas, G., Lechleitner, M., & Bacao, F. (2022). Improving the quality of predictive models in small data GSDOT: A new algorithm for generating synthetic data. PLoS One, 17(4), 1–15. https://doi.org/10.1371/journal.pone.0265626.
  • El Mrabet, M. A., El Makkaoui, K., & Faize, A. (2021). Supervised machine learning: A survey. In Proceedings of the 4th International Conference on Advanced Communication Technologies and Networking (CommNet 2021). https://doi.org/10.1109/CommNet52204.2021.9641998.
  • Hashimoto, D. A., Ward, T. M., & Meireles, O. R. (2020). The role of artificial intelligence in surgery. Advances in Surgery, 54, 89–101. https://doi.org/10.1016/j.yasu.2020.05.010.
  • Karaboğa, D. (2020). Yapay Zeka Optimizasyon Algoritmaları,Nobel Akademik Yayıncılık, 7. Baskı. Kaya, E., Gorkemli, B., Akay, B., & Karaboga, D. (2022). A review on the studies employing artificial bee colony algorithm to solve combinatorial optimization problems. Engineering Applications of Artificial Intelligence, 115. https://doi.org/10.1016/j.engappai.2022.105311.
  • Kinaneva, D., Hristov, G., Kyuchukov, P., Georgiev, G., Zahariev, P., & Daskalov, R. (2021). Machine learning algorithms for regression analysis and predictions of numerical data. In 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA).
  • Li, M., Zhuang, D., & Chang, J. M. (2023). MC-GEN: Multi-level clustering for private synthetic data generation. Knowledge-Based Systems, 264, 1–11. https://doi.org/10.1016/j.knosys.2022.110239.
  • Li, Z., Zhao, Y., & Fu, J. (2020). SynC: A copula-based framework for generating synthetic data from aggregated sources. In 2020 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 571–578). https://doi.org/10.1109/ICDMW51313.2020.00082.
  • Lu, Y., Shen, M., Wang, H., & Wei, W. (2021). Machine learning for synthetic data generation: A review. Journal of Big Data, 14(8), 1–18.
  • Parhi, S. K., & Patro, S. K. (2023). Prediction of compressive strength of geopolymer concrete using a hybrid ensemble of grey wolf optimized machine learning estimators. Journal of Building Engineering, 71. https://doi.org/10.1016/j.jobe.2023.106521.
  • Patki, N., Wedge, R., & Veeramachaneni, K. (2016). The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 399–410). https://doi.org/10.1109/DSAA.2016.49.
  • Ping, H., Stoyanovich, J., & Howe, B. (2017). DataSynthesizer: Privacy-preserving synthetic datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management (SSDBM ’17) (pp. 1–5). https://doi.org/10.1145/3085504.3091117.
  • UCI. (2024a). (the University of California Irvine Machine Learning Repository). https://archive.ics.uci.edu/.
  • UCI. (2024b). (the University of California Irvine Machine Learning Repository)- Lenses. https://archive.ics.uci.edu/dataset/58/lenses.
  • UCI. (2024c). (the University of California Irvine Machine Learning Repository)- COVID-19 Surveillance. https://archive.ics.uci.edu/dataset/567/covid+19+surveillance.
  • UCI. (2024d). (the University of California Irvine Machine Learning Repository)- Balloons. https://archive.ics.uci.edu/dataset/13/balloons.
  • UCI. (2024e). (the University of California Irvine Machine Learning Repository)- Caesarian Section. https://archive.ics.uci.edu/dataset/472/caesarian+section+classification+dataset.
  • UCI. (2024f). (the University of California Irvine Machine Learning Repository)- Post-Operative Patient. https://archive.ics.uci.edu/dataset/82/post+operative+patient.
  • Wharrie, S., et al. (2022). HAPNEST: An efficient tool for generating large-scale genetics datasets from limited training data. In NeurIPS 2022 Workshop on Synthetic Data Empowering Machine Learning Research (pp. 1–7).
There are 26 citations in total.

Details

Primary Language English
Subjects Software Engineering (Other)
Journal Section Bilgisayar Mühendisliği / Computer Engineering
Authors

Fatma Akalın 0000-0001-6670-915X

Publication Date December 1, 2024
Submission Date June 4, 2024
Acceptance Date September 12, 2024
Published in Issue Year 2024 Volume: 14 Issue: 4

Cite

APA Akalın, F. (2024). Synthetic Data Generation with Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling. Journal of the Institute of Science and Technology, 14(4), 1408-1431. https://doi.org/10.21597/jist.1495455
AMA Akalın F. Synthetic Data Generation with Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling. J. Inst. Sci. and Tech. December 2024;14(4):1408-1431. doi:10.21597/jist.1495455
Chicago Akalın, Fatma. “Synthetic Data Generation With Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling”. Journal of the Institute of Science and Technology 14, no. 4 (December 2024): 1408-31. https://doi.org/10.21597/jist.1495455.
EndNote Akalın F (December 1, 2024) Synthetic Data Generation with Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling. Journal of the Institute of Science and Technology 14 4 1408–1431.
IEEE F. Akalın, “Synthetic Data Generation with Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling”, J. Inst. Sci. and Tech., vol. 14, no. 4, pp. 1408–1431, 2024, doi: 10.21597/jist.1495455.
ISNAD Akalın, Fatma. “Synthetic Data Generation With Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling”. Journal of the Institute of Science and Technology 14/4 (December 2024), 1408-1431. https://doi.org/10.21597/jist.1495455.
JAMA Akalın F. Synthetic Data Generation with Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling. J. Inst. Sci. and Tech. 2024;14:1408–1431.
MLA Akalın, Fatma. “Synthetic Data Generation With Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling”. Journal of the Institute of Science and Technology, vol. 14, no. 4, 2024, pp. 1408-31, doi:10.21597/jist.1495455.
Vancouver Akalın F. Synthetic Data Generation with Modified Artificial Bee Colony Optimization Algorithm and Statistical Modeling. J. Inst. Sci. and Tech. 2024;14(4):1408-31.