Research Article
BibTex RIS Cite

SOLVING MISSING DATA PROBLEMS RELATED TO DATA QUALITY IN WINE PRODUCTION BY DEEP LEARNING: AN APPLICATION WITH GENERATIVE ADVERSIAL NETWORKS

Year 2021, Volume: 5 Issue: 1, 99 - 111, 21.06.2021
https://doi.org/10.30625/ijctr.943818

Abstract

The aim of the study is to select the appropriate method to solve the missing data problems affecting the data quality in wine production and to create a guide for wine producing businesses to refer to in the face of missing data problems. For this purpose, an incomplete data problem was created on the data set used in classifying the wines in terms of quality, in a way that disrupts the integrity, and the necessary steps for the solution of the problem were analyzed. In the study, the use of Wasserstein Generative Adversial Networks (WGAIN), an improved version of the Generator Adversial Networks (GAN) algorithm, is proposed for the missing data completion task. This new architecture was created with the idea of changing the cost function developed against the problems common in GANs and generalized so that it can cope with the unique features of the assignment problem. In the experiment performed with the real world dataset, it was determined that the values of the Root Mean Square Error (RMSE) obtained for WGAIN performed significantly better than the other imputation techniques.

References

  • Abdella, M. ve Marwala, T. (2005a). Treatment Of Missing Data Using Neural Networks. In: Proceedings Of The IEEE International Joint Conference On Neural Networks. 1: 598–603.
  • Abdella, M. ve Marwala, T. (2005b). The Use Of Genetic Algorithms And Neural Networks To Approximate Missing Data In Database. IEEE 3rd International Conference On Computational Cybernetics. 24: 577–589.
  • Allison, P. D. (2002). Missing Data. University of Pennsylvania, USA: Sage Publications.
  • Anderson, D. R., Sweeney, D.J. ve Williams, T.A. (2011). Statistics For Business And Economics. Boston: Cengage Learning.
  • Arjovsky, M., Chintala, S. ve Bottou, L. (2017). Wasserstein GAN. Courant Institute Of Mathematical Sciences: Facebook AI Research. 1-32.
  • Batini, C. ve Scannapieca, M. (2016). Data And Information Quality: Dimensions, Principles And Techniques. Switzerland: Springer International Publishing.
  • Batini, C. ve Scannapieca, M. (2006). Data Quality: Concepts, Methodologies And Techniques. Berlin: Springer Verlag.
  • Bengio,Y., LeCun, Y. ve Hinton, G. (2015). Deep Learning. Nature. 521 (7553): 436–444.
  • Bosij, P., Chafey, D., Greasley, A. ve Hickie, S. (2003). Business Information Systems: Technology, Development and Management. London: Pearson.
  • Brackstone, G. (2001). Managing Data Quality: The Accuracy Dimension. International Conference on Quality In Official Statistics. 2(3): 16-32.
  • Brock, A., Donahue, J. ve Simonyan, K. (2018). Large Scale GAN Training For High Fidelity Natural Image Synthesis. In: ArXiv abs/1809.11096. 1-35.
  • Buuren, S. V. (2019). Multivariate Imputation by Chained Equations. Version: 3.7.0. https://cran.r-project.org/web/packages/mice/index.html, Erişim Tarihi: 25.04.2020.
  • Buuren, S. V. (2018). Flexible Imputation of Missing Data. New York: CRC Press.
  • Buuren, S. V. (2012). Flexible Imputation Of Missing Data. New York: CRC Press.
  • Cortez, P., Cerdeira, A., Almeida, F., Matos, T. ve Reis, J. (2009). Modeling Wine Preferences By Data Mining From Physicochemical Properties. In Decision Support Systems.Elsevier. 47(4): 547-553.
  • Çilingirtürk, A. M. ve Altaş, D. (2010). Makro İktisat Verilerinde Kayıp Verilerin Regresyona Dayalı En Yakın Komşu ‘Hot Deck’ Yöntemi İle Tamamlanması. Dokuz Eylül Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi. 25(2): 76-77.
  • Dempster, A. P., Laird, N. M. ve Rubin, D. B. (1977). Maximum Likelihood From Incomplete Data Via The EM Algorithm. Journal Of The Royal Statistical Society. 39(1): 1-38.
  • Dhlamini, S. M., Nelwamondo, F. V. ve Marwala, T. (2006). Condition Monitoring Of HV Bushings In The Presence Of Missing Data Using Evolutionary Computing. Transactions On Power Systems. 1(2): 280–287.
  • Enders, C. K. (2010). Applied Missing Data Analysis. New York: Guilford Press.
  • Eurostat. Standart Quality Report. (2000). http://www.unece.org/stats/documents/2000/11/ metis/crp.3.e.pdf, Erişim Tarihi: 17.12.2019.
  • Fadahunsi, K. P., Akinlua, J. T., O'Connor, S., Wark, P. A., Gallagher, J., Carroll, C., Majeed, A. ve O'Donoghue, J. (2019). Protocol For A Systematic Review And Qualitative Synthesis Of Information Quality Frameworks In eHealth. BMJ Open. 9(3): 1-5.
  • Gartner. (2018). How to Create a Business Case for Data QualityImprovement. https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement/, Erişim Tarihi: 20.06.2020.
  • Goodfellow, I., Jean, P. A., Mehdi, M., Bing, X., David, W. F., Ozair, S. C. ve Aaron, B. Y. (2014). Generative Adversarial Networks. Proceedings of the International Conference on Neural Information Processing Systems. 2672–2680.
  • Graham, J. W. (2012). Missing Data: Analysis and Design. Germany: Springer.
  • Gürsakal, N. (2007). Betimsel İstatistik. Ankara: Nobel Yayın Dağıtım.
  • IBM. (2019). Big Data And Analytics Hub. https://www.ibmbigdatahub.com/infographic/ extracting-business-value-4-vs-big-data, Erişim Tarihi: 20.06.2020.
  • Ivanov, O., Figurnov, M. ve Vetrov, D. (2019). Variational Autoencoder with Arbitrary Conditioning. In: International Conference On Learning Representations. 1-25.
  • Karr, A. F., Sanil A. P. ve Banks, D. L. (2006). Data Quality: A Statistical Perspective. Statistical Methodology. 3: 137-173.
  • Kingma, D. ve Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations. 1-14.
  • Leke, C. A. ve Marwala., T. (2019), Deep Learning and Missing Data in Engineering Systems, Studies in Big Data 48. Switzerland: Springer Nature AG.
  • Little, R. J. A. (1988). A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 83 (404): 1198–1202.
  • Little, R. ve Rubin, D. (2020) Statistical Analysis with Missing Data. New York: John Wiley & Sons.
  • Little, R. ve Rubin, D. (2014). Statistical Analysis With Missing Data. New York: Wiley.
  • Little, R. J. A. ve Rubin, D. B. (1987). Statistical Analysis With Missing Data. New York: Wiley.
  • Loshin, D. (2006). Monitoring Data Quality Performance Using Data Quality Metrics. USA: Informatica.
  • Ming-Hau, C. (2010). Pattern Recognition Of Business Failure By Autoassociative Neural Networks In Considering The Missing Values. International Computer Symposium. 711–715.
  • Nelwamondo, F. V. ve Marwala, T. (2007). Handling Missing Data From Heteroskedastic And Nonstationary Data. Lecture Notes In Computer Science. 4491(1): 1297–1306.
  • Raghunathan, T. (2016). Missing Data Analysis in Practice. New York: Chapman and Hall/CRC.
  • Redman, T. C. (2008). Data Driven: Profiting From Your Most Important Business Asset. Massachusetts: Harvard Business Press.
  • Rubin, D. (1978). Multiple Imputations In Sample Surveys A Phenomenological Bayesian Approach To Nonresponse. Proceedings Of The Survey Research Methods Section Of The American Statistical Association. 1: 20–34.
  • Rubin, D. B. (1976). Inference and Missing Data. Biometrika. 63(3): 581-592.
  • Sattler, K. (2009) Data Quality Dimensions. Encyclopedia Of Database Systems. Boston: Springer.
  • Scarisbrick-Hauser, A. ve Rouse, C. (2007). The Whole Truth And Nothing But The Truth? The Role of Data Quality Today. Direct Marketing An International Journal. 1(3): 161-171.
  • Schouten, R., Lugtig, P. ve Vink, G. (2018). Generating Missing Values For Simulation Purposes: A Multivariate Amputation Procedure. Journal of Statistical Computation and Simulation. 1-22.
  • Statistics Netherlands. (2008). Quality Declarations of Statistics Netherlands. http://www.cbs.nl/en-GB/menu/organisatie/kwaliteitsverklaring/default.htm, Erişim Tarihi: 16.12.2019.
  • Stekhoven, D. J . (2013). Nonparametric Missing Value Imputation Using Random Forest. https://rdrr.io/cran/missForest/man/missForest.html, Erişim Tarihi: 25.04.2020.
  • Şeker, Ş. E. ve Eşmekaya, E. (2017). Eksik Verilerin Tamamlanması. YBS Ansiklopedi. 4(3): 10-17.
  • Şencan, H. (2005). Sosyal ve Davranışsal Ölçümlerde Güvenilirlik ve Geçerlilik. Ankara: Seçkin Kitabevi.
  • Twala, B. ve Cartwright,M. (2010). Ensemble Missing Data Techniques For Software Effort Prediction. Intelligent Data Analysis. 14(3): 299–331.
  • Wand, Y. ve Wang, R. Y. (1996). Anchoring Data Quality Dimensions In Ontological Foundations. Communications of the ACM. 39(11): 86–95.
  • Yoon, J., Jordon, J. ve Van Der Schaar, M. (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets. Proceedings of the 35th International Conference on Machine Learning. 80: 5689-5698.

ŞARAP ÜRETİMİNDE VERİ KALİTESİNE İLİŞKİN EKSİK VERİ SORUNLARININ DERİN ÖĞRENME İLE ÇÖZÜLMESİ: ÜRETİCİ ÇEKİŞMECİ AĞLARLA BİR UYGULAMA

Year 2021, Volume: 5 Issue: 1, 99 - 111, 21.06.2021
https://doi.org/10.30625/ijctr.943818

Abstract

Araştırmanın amacı şarap üretiminde veri kalitesini etkileyen eksik veri problemlerini çözmek için uygun yöntemin seçilmesi ve şarap üreten işletmeler için eksik veri problemleri karşısında başvurabilecekleri bir rehber oluşturmaktır. Bu amaç doğrultusunda şarapların kalite bakımından sınıflandırmasında kullanılan veri seti üzerinde bütünlüğü bozacak şekilde eksik veri problemi yaratılmış ve problemin çözümü için gerekli aşamalar analiz edilmiştir. Çalışmada eksik veri tamamlama görevi için üretici modeller sınıfına giren Üretici Çekişmeli Ağlar (GAN-Generative Adversial Networks) algoritmasının geliştirilmiş versiyonu Wasserstein Üretici Çekişmeli Atama Ağları (WGAIN-Wasserstein Generative Adversial Imputation Networks) kullanımı önerilmiştir. Bu yeni mimari, GAN’larda sıklıkla görülen problemlere karşı geliştirilmiş maliyet fonksiyonunun değiştirilmesi fikriyle oluşturulmuş ve atama probleminin benzersiz özellikleri ile başa çıkabileceği şekilde genelleştirilmiştir. Gerçek dünya veri kümesiyle yapılan deneyde, WGAIN için elde edilen hata karelerinin kök ortalaması (RMSE-Root Mean Square Error) değerleri ile diğer atama tekniklerinden önemli ölçüde daha iyi performans gösterdiği tespit edilmiştir.

References

  • Abdella, M. ve Marwala, T. (2005a). Treatment Of Missing Data Using Neural Networks. In: Proceedings Of The IEEE International Joint Conference On Neural Networks. 1: 598–603.
  • Abdella, M. ve Marwala, T. (2005b). The Use Of Genetic Algorithms And Neural Networks To Approximate Missing Data In Database. IEEE 3rd International Conference On Computational Cybernetics. 24: 577–589.
  • Allison, P. D. (2002). Missing Data. University of Pennsylvania, USA: Sage Publications.
  • Anderson, D. R., Sweeney, D.J. ve Williams, T.A. (2011). Statistics For Business And Economics. Boston: Cengage Learning.
  • Arjovsky, M., Chintala, S. ve Bottou, L. (2017). Wasserstein GAN. Courant Institute Of Mathematical Sciences: Facebook AI Research. 1-32.
  • Batini, C. ve Scannapieca, M. (2016). Data And Information Quality: Dimensions, Principles And Techniques. Switzerland: Springer International Publishing.
  • Batini, C. ve Scannapieca, M. (2006). Data Quality: Concepts, Methodologies And Techniques. Berlin: Springer Verlag.
  • Bengio,Y., LeCun, Y. ve Hinton, G. (2015). Deep Learning. Nature. 521 (7553): 436–444.
  • Bosij, P., Chafey, D., Greasley, A. ve Hickie, S. (2003). Business Information Systems: Technology, Development and Management. London: Pearson.
  • Brackstone, G. (2001). Managing Data Quality: The Accuracy Dimension. International Conference on Quality In Official Statistics. 2(3): 16-32.
  • Brock, A., Donahue, J. ve Simonyan, K. (2018). Large Scale GAN Training For High Fidelity Natural Image Synthesis. In: ArXiv abs/1809.11096. 1-35.
  • Buuren, S. V. (2019). Multivariate Imputation by Chained Equations. Version: 3.7.0. https://cran.r-project.org/web/packages/mice/index.html, Erişim Tarihi: 25.04.2020.
  • Buuren, S. V. (2018). Flexible Imputation of Missing Data. New York: CRC Press.
  • Buuren, S. V. (2012). Flexible Imputation Of Missing Data. New York: CRC Press.
  • Cortez, P., Cerdeira, A., Almeida, F., Matos, T. ve Reis, J. (2009). Modeling Wine Preferences By Data Mining From Physicochemical Properties. In Decision Support Systems.Elsevier. 47(4): 547-553.
  • Çilingirtürk, A. M. ve Altaş, D. (2010). Makro İktisat Verilerinde Kayıp Verilerin Regresyona Dayalı En Yakın Komşu ‘Hot Deck’ Yöntemi İle Tamamlanması. Dokuz Eylül Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi. 25(2): 76-77.
  • Dempster, A. P., Laird, N. M. ve Rubin, D. B. (1977). Maximum Likelihood From Incomplete Data Via The EM Algorithm. Journal Of The Royal Statistical Society. 39(1): 1-38.
  • Dhlamini, S. M., Nelwamondo, F. V. ve Marwala, T. (2006). Condition Monitoring Of HV Bushings In The Presence Of Missing Data Using Evolutionary Computing. Transactions On Power Systems. 1(2): 280–287.
  • Enders, C. K. (2010). Applied Missing Data Analysis. New York: Guilford Press.
  • Eurostat. Standart Quality Report. (2000). http://www.unece.org/stats/documents/2000/11/ metis/crp.3.e.pdf, Erişim Tarihi: 17.12.2019.
  • Fadahunsi, K. P., Akinlua, J. T., O'Connor, S., Wark, P. A., Gallagher, J., Carroll, C., Majeed, A. ve O'Donoghue, J. (2019). Protocol For A Systematic Review And Qualitative Synthesis Of Information Quality Frameworks In eHealth. BMJ Open. 9(3): 1-5.
  • Gartner. (2018). How to Create a Business Case for Data QualityImprovement. https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement/, Erişim Tarihi: 20.06.2020.
  • Goodfellow, I., Jean, P. A., Mehdi, M., Bing, X., David, W. F., Ozair, S. C. ve Aaron, B. Y. (2014). Generative Adversarial Networks. Proceedings of the International Conference on Neural Information Processing Systems. 2672–2680.
  • Graham, J. W. (2012). Missing Data: Analysis and Design. Germany: Springer.
  • Gürsakal, N. (2007). Betimsel İstatistik. Ankara: Nobel Yayın Dağıtım.
  • IBM. (2019). Big Data And Analytics Hub. https://www.ibmbigdatahub.com/infographic/ extracting-business-value-4-vs-big-data, Erişim Tarihi: 20.06.2020.
  • Ivanov, O., Figurnov, M. ve Vetrov, D. (2019). Variational Autoencoder with Arbitrary Conditioning. In: International Conference On Learning Representations. 1-25.
  • Karr, A. F., Sanil A. P. ve Banks, D. L. (2006). Data Quality: A Statistical Perspective. Statistical Methodology. 3: 137-173.
  • Kingma, D. ve Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations. 1-14.
  • Leke, C. A. ve Marwala., T. (2019), Deep Learning and Missing Data in Engineering Systems, Studies in Big Data 48. Switzerland: Springer Nature AG.
  • Little, R. J. A. (1988). A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 83 (404): 1198–1202.
  • Little, R. ve Rubin, D. (2020) Statistical Analysis with Missing Data. New York: John Wiley & Sons.
  • Little, R. ve Rubin, D. (2014). Statistical Analysis With Missing Data. New York: Wiley.
  • Little, R. J. A. ve Rubin, D. B. (1987). Statistical Analysis With Missing Data. New York: Wiley.
  • Loshin, D. (2006). Monitoring Data Quality Performance Using Data Quality Metrics. USA: Informatica.
  • Ming-Hau, C. (2010). Pattern Recognition Of Business Failure By Autoassociative Neural Networks In Considering The Missing Values. International Computer Symposium. 711–715.
  • Nelwamondo, F. V. ve Marwala, T. (2007). Handling Missing Data From Heteroskedastic And Nonstationary Data. Lecture Notes In Computer Science. 4491(1): 1297–1306.
  • Raghunathan, T. (2016). Missing Data Analysis in Practice. New York: Chapman and Hall/CRC.
  • Redman, T. C. (2008). Data Driven: Profiting From Your Most Important Business Asset. Massachusetts: Harvard Business Press.
  • Rubin, D. (1978). Multiple Imputations In Sample Surveys A Phenomenological Bayesian Approach To Nonresponse. Proceedings Of The Survey Research Methods Section Of The American Statistical Association. 1: 20–34.
  • Rubin, D. B. (1976). Inference and Missing Data. Biometrika. 63(3): 581-592.
  • Sattler, K. (2009) Data Quality Dimensions. Encyclopedia Of Database Systems. Boston: Springer.
  • Scarisbrick-Hauser, A. ve Rouse, C. (2007). The Whole Truth And Nothing But The Truth? The Role of Data Quality Today. Direct Marketing An International Journal. 1(3): 161-171.
  • Schouten, R., Lugtig, P. ve Vink, G. (2018). Generating Missing Values For Simulation Purposes: A Multivariate Amputation Procedure. Journal of Statistical Computation and Simulation. 1-22.
  • Statistics Netherlands. (2008). Quality Declarations of Statistics Netherlands. http://www.cbs.nl/en-GB/menu/organisatie/kwaliteitsverklaring/default.htm, Erişim Tarihi: 16.12.2019.
  • Stekhoven, D. J . (2013). Nonparametric Missing Value Imputation Using Random Forest. https://rdrr.io/cran/missForest/man/missForest.html, Erişim Tarihi: 25.04.2020.
  • Şeker, Ş. E. ve Eşmekaya, E. (2017). Eksik Verilerin Tamamlanması. YBS Ansiklopedi. 4(3): 10-17.
  • Şencan, H. (2005). Sosyal ve Davranışsal Ölçümlerde Güvenilirlik ve Geçerlilik. Ankara: Seçkin Kitabevi.
  • Twala, B. ve Cartwright,M. (2010). Ensemble Missing Data Techniques For Software Effort Prediction. Intelligent Data Analysis. 14(3): 299–331.
  • Wand, Y. ve Wang, R. Y. (1996). Anchoring Data Quality Dimensions In Ontological Foundations. Communications of the ACM. 39(11): 86–95.
  • Yoon, J., Jordon, J. ve Van Der Schaar, M. (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets. Proceedings of the 35th International Conference on Machine Learning. 80: 5689-5698.
There are 51 citations in total.

Details

Primary Language Turkish
Subjects Tourism (Other)
Journal Section Original Scientific Article
Authors

Şevhat Doger This is me 0000-0001-9174-159X

Avşar Kurgun 0000-0002-2092-5292

Publication Date June 21, 2021
Submission Date May 28, 2021
Published in Issue Year 2021 Volume: 5 Issue: 1

Cite

APA Doger, Ş., & Kurgun, A. (2021). ŞARAP ÜRETİMİNDE VERİ KALİTESİNE İLİŞKİN EKSİK VERİ SORUNLARININ DERİN ÖĞRENME İLE ÇÖZÜLMESİ: ÜRETİCİ ÇEKİŞMECİ AĞLARLA BİR UYGULAMA. International Journal of Contemporary Tourism Research, 5(1), 99-111. https://doi.org/10.30625/ijctr.943818