Research Article
BibTex RIS Cite

Code Assignment System (KASIS) for International Statistical Classifications

Year 2020, Volume: 13 Issue: 3, 313 - 327, 31.07.2020
https://doi.org/10.17671/gazibtd.588097

Abstract

Statistical classifications have a great place and importance in the statistical systems of countries. The way to use statistical classification is code assignment. Code assignment consists of matching the textual definition with the definition in the standard classification dictionary and using the code in the dictionary corresponding to this definition. In questionnaires, textual definitions are often used to classify variables in right groups. The correct classification of the variables will ensure that the results of the studies to be conducted with these variables are correct. As the number of records increases, manual methods will not be sufficient to check that variables are classified in the correct groups. Therefore, there is a need for an automated system that can perform this process. This study introduces a system that can automatically check whether the variables using classification are classified in the correct group. The effectiveness of the system was tested using the 2017 Household Budget Survey (HBS) micro data set made by Turkey Statistical Institute (TURKSTAT). This data set is the main source of consumption expenditure statistics in our country. Classification of Individual Consumption by Purpose (COICOP) is used in the classification of consumption expenditures. The code assignment made by the interviewer was checked with the developed system and the results were examined. The developed system differs from systems using supervised machine learning methods by not needing a training data set. Starting from the zero point, the system can start working and continues its learning by increasing its learning in each additional record. This system can also contribute to the correct learning of systems using this method by controlling whether the classification of records in the education data set has been made correctly or not.

References

  • Internet: Kalite Güven Çerçevesi 2015, Türkiye İstatistik Kurumu, http://www.tuik.gov.tr/jsp/duyuru/upload/TUIK_Kalite_Guvence_Cercevesi.pdf, 10.04.2019.
  • F. Akdeniz, “İstatistikte Yeni Eğilimler ve Yöntemler”, Journal of Statistical Research, 10(3), 35-48, 2013.
  • Internet: Sınıflama Sunucusu, Türkiye İstatistik Kurumu, https://biruni.tuik.gov.tr/DIESS/, 10.04.2019.
  • Internet: NACE Rev.2 Altılı Ekonomik Faaliyet Sınıflaması, Türkiye Odalar ve Borsalar Birliği, http://gen.tobb.org.tr/ggnot/images/bilgi_notu/277_DUYURU-GD1.pdf, 14.04.2019.
  • Internet: Sınıflama Sunucusu, https://biruni.tuik.gov.tr/DIESS/SiniflamaTurListeAction.do, 12.04.2019.
  • Internet: Türk Dil Kurumu Büyük Türkçe Sözlük, http://www.tdk.gov.tr/index.php?option=com_bts&arama=kelime&guid=TDK.GTS.59e33dfc8b1280.30675020, 10.04.2019.
  • Internet: Quality Guidelines for Official Statistics, https://unstats.un.org/unsd/dnss/docs-nqaf/Finland-g_2ed_en.pdf, 10.04.2019.
  • Internet: Türkiye İstatistik Kanunu, http://tuik.gov.tr/jsp/duyuru/upload/TuikKanun.pdf, 12.04.2019.
  • M. Schierholz, Automating survey coding for occupation, Yüksek Lisans Tezi, Ludwig Maximilians Universitat, Institut fur Statistik, 2014.
  • W. Hacking, L. Willenborg, Theme: Coding; Interpreting Short Descriptions Using a Classification, Statistics Netherlands, The Hague/Heerlen, 2012.
  • C. C. Aggarwal, C. A. Zhai, “Survey of Text Classification Algorithms”, Mining Text Data, Springer, Boston, MA, 163-222, 2012.
  • R. M. Groves, F. J. Fowler, M. P. Couper, J. M. Lepkowski, E. Singer, R. Tourangeau, Survey Methodology, John Wiley & Sons, New Jersey, A.B.D., 2009.
  • A. Esuli, F. Sebastiani, “Machines That Learn How To Code Open-Ended Survey Data”, International Journal of Market Research, 52(6), 775-800, 2010.
  • J. Y. Tourigny, J. Moloney, Statistical Data Editing Volume No. 2 Methods And Techniques, United Nations Statistical Commission and Economic Commission for Europe, New York and Geneva, 1997.
  • P. Dalton, G. Keogh. “Automatic Coding of Occupations: The Irish Experience”. New Techniques and Technologies for Statistics II Proceedings of the Second Bonn Seminar, Netherlands: IOS Press, 33-44, 1997
  • F. G. Conrad, “Using Expert Systems to Model and Improve Survey Classification Processes”, Survey Measurement and Process Quality, John Wiley & Sons, New York, 393–414, 1997.
  • D. Bushnell, “An Evaluation of Computer-assisted Occupation Coding”, New Methods for Survey Research, Southampton, 23–36, 1998.
  • J. Fielding, N. Fielding, G. Hughes, “Opening Up Open-Ended Survey Data Using Qualitative Software”, Quality & Quantity, 47(6), 3261–3276, 2013.
  • M. Roessingh, J. Bethlehem, “Trigram coding in the family expenditure survey in statistics,” Netherlands Central Bureau of Statistics, 1983.
  • F. R. Clarke, S. Brooker, “Use of Machine Learning for Automated Survey Coding”, International Statistical Institute Proceedings of the 58th World Statistics Congress 2011, Dublin Convention Centre, İrlanda, 2011.
  • G. Alfons, Handbuch für die Berufsvercodung, Mannheim, 2011.
  • K. Drasch, B. Matthes, M. Munz, W. Paulus, M. A. Valentin, Arbeiten und Lernen im Wandel Teil V: Die Codierung der offenen Angaben zur beruflichen Tätigkeit, Ausbildung und Branche, Nuremberg, 2012.
  • Y. Jung, J. Yoo, S. H. Myaeng, D. C. Han, “A WebBased Automated System for Industry and Occupation Coding”, Web Information Systems Engineering - WISE 2008, Lecture Notes in Computer Science, 5175, 443–457, 2008.
  • F. G. Conrad, “Using Expert Systems To Model And Improve Survey Classification Processes”, Survey Measurement and Process Quality, John Wiley & Sons, New York, 393–414, 1997.
  • J. Hartmann, G. Schütz, Die Klassifizierung der Berufe und der Wirtschaftszweige im Sozio-oekonomischen Panel, Munich, 2002.
  • K. Chu, C. Poirier, “Machine learning documentation initiative (Canada)”, Workshop on the Modernisation of Statistical Production, İsviçre, 2015.
  • A. Bethmann, M. Schierholz, K. Wenzig, M. Zielonka, “Automatic Coding of Occupations Using Machine Learning Algorithms for Occupation Coding in Several German Panel Surveys”, Beyond traditional survey taking. Adapting to a changing world, Kanada, 2014.
  • M. Belloni, A. Brugiavini, E. Meschi, K. Tijdens, “Measuring and detecting errors in occupational coding: an analysis of share data”, Journal of Official Statistics, 32(4), 917-945, 2016.
  • Internet: M. Beck, F. Dumpert, J. Feuerhake, Machine Learning in Official Statistics, https://arxiv.org/abs/1812.10422v1, 21.04.2019.
  • S. By De Matteis, D. Jarvis, H. Young, A. Young, N. Allen, J. Potts, A. Darnton, L. Rushton, P. Cullinan, ”Occupational self coding and automatic recording (OSCAR): an innovative validated web-based tool to collect lifetime job histories in large population”, Scandinavian Journal of Work, Environment & Health, 43(2), 181-186,2017.
  • A. Haslinger. Automatic Coding and Text Processing using N-grams. In Conference of European Statisticians. Statistical Standards and Studies – No. 48. Statistical Data Editing, Volume No. 2, Methods and Techniques, pages 199-209. UNO, New York and Geneva, 1997.
  • J. Hilden, J.D.F. Habbema, B. Bjerregaard, “The measurement of performance in probabilistic diagnosis, I. The problem, descriptive tools, and measures based on classification matrices”, Methods of information in medicine, 17, 217-226, 1978.
  • J. Hilden, J.D.F. Habbema, B. Bjerregaard, “The measurement of performance in probabilistic diagnosis, II. Trustworthiness of the exact values of the diagnostic probabilities”, Methods of information in medicine, 17, 227- 237, 1978.
  • J. Hilden, J.D.F. Habbema, B. Bjerregaard, “The measurement of performance in probabilistic diagnosis, III. Methods based on continuous 54 functions of the diagnostic probabilities”, Methods of information in medicine, 17, 238-246, 1978.
  • Hanehalkı Bütçe İstatistikleri Mikro Veri Seti CD., Türkiye İstatistik Kurumu, Ankara, 2017.
  • Internet: Hanehalkı Bütçe Anketinin Kapsamı, Yöntemi, Tanım ve Kavramları Hakkında Genel Açıklamalar, http://tuik.gov.tr/HbGetir.do?id=27840&tb_id=7, 12.04.2019.
  • Internet: Household Budget Surveys in the EU, Methodology and recommendations for harmonisation, https://ec.europa.eu/eurostat/web/products-manuals-and-guidelines/-/KS-BF-03-003?inheritRedirect=true, 12.04.2019.
  • Internet: Classification of Individual Consumption According to Purpose (COICOP) 2018, https://unstats.un.org/unsd/classifications/unsdclassifications/COICOP_2018_-_pre-edited_white_cover_version_-_2018-12-26.pdf, 12.04.2019.
  • Internet: Household Budget Survey (HBS), https://ec.europa.eu/eurostat/web/household-budget-surveys/policy-context, 12.04.2019.

Uluslararası İstatistiksel Sınıflamalara Yönelik Kod Atama Sistemi (KASİS)

Year 2020, Volume: 13 Issue: 3, 313 - 327, 31.07.2020
https://doi.org/10.17671/gazibtd.588097

Abstract

İstatistiksel sınıflamaların, ülkelerin istatistik sistemlerinde çok büyük bir yeri ve önemi bulunmaktadır. İstatistiksel sınıflama kullanabilmenin yolu kod atamadan geçmektedir. Kod atama, elimizdeki metinsel tanım ile standart sınıflama sözlüğünde yer alan tanımı eşleştirme ve bu tanıma karşılık gelen sözlükteki kodu kullanma işleminden oluşmaktadır. Anketlerde, değişkenleri doğru gruplarda sınıflayabilmek için metinsel tanımlar sıklıkla kullanılmaktadır. Değişkenlerin sınıflamasının doğru olarak yapılmış olması bu değişkenler ile yapılacak araştırmaların sonuçlarının doğru olmasını sağlayacaktır. Kayıt sayısı arttıkça, değişkenlerin doğru gruplarda sınıflandığını kontrol etmek için manuel yöntemler yeterli olmayacaktır. Bu yüzden bu işlemi yapabilecek otomatik bir sisteme ihtiyaç duyulmaktadır. Bu çalışmada, sınıflama kullanan değişkenlerin doğru grupta sınıflanıp sınıflanmadığını otomatik şekilde kontrol edebilen sistem tanıtılmaktadır. Sistemin etkinliği, Türkiye İstatistik Kurumu’nun (TÜİK) yapmış olduğu Hanehalkı Bütçe Araştırması (HBA) 2017 yılı veri seti kullanılarak değerlendirilmiştir. Bu veri seti, ülkemizdeki tüketim harcamaları istatistiklerinin ana kaynağıdır. Tüketim harcamalarının sınıflamasında Uluslararası Bireysel Tüketimin Amaca Göre Sınıflaması (COICOP) kullanılmaktadır. Anketör tarafından kod ataması yapılmış kayıtlar, geliştirilen sistem ile kontrol edilerek sonuçları incelenmiştir. Geliştirilen sistem, denetimli makine öğrenmesi yöntemlerini kullanan sistemlerden eğitim veri kümesine ihtiyaç duymaması ile ayrılmaktadır. Sıfır noktasından itibaren sistem çalışmaya başlayabilir ve her bir ilave kayıtta kendi öğrenmesini artırarak devam etmektedir. Bu sistem, eğitim veri kümesindeki kayıtların sınıflamasının doğru olarak yapılıp yapılmadığını kontrol ederek denetimli makine öğrenmesi yöntemini kullanan sistemlerin doğru şekilde öğrenmelerine de katkı sağlayabilmektedir.

References

  • Internet: Kalite Güven Çerçevesi 2015, Türkiye İstatistik Kurumu, http://www.tuik.gov.tr/jsp/duyuru/upload/TUIK_Kalite_Guvence_Cercevesi.pdf, 10.04.2019.
  • F. Akdeniz, “İstatistikte Yeni Eğilimler ve Yöntemler”, Journal of Statistical Research, 10(3), 35-48, 2013.
  • Internet: Sınıflama Sunucusu, Türkiye İstatistik Kurumu, https://biruni.tuik.gov.tr/DIESS/, 10.04.2019.
  • Internet: NACE Rev.2 Altılı Ekonomik Faaliyet Sınıflaması, Türkiye Odalar ve Borsalar Birliği, http://gen.tobb.org.tr/ggnot/images/bilgi_notu/277_DUYURU-GD1.pdf, 14.04.2019.
  • Internet: Sınıflama Sunucusu, https://biruni.tuik.gov.tr/DIESS/SiniflamaTurListeAction.do, 12.04.2019.
  • Internet: Türk Dil Kurumu Büyük Türkçe Sözlük, http://www.tdk.gov.tr/index.php?option=com_bts&arama=kelime&guid=TDK.GTS.59e33dfc8b1280.30675020, 10.04.2019.
  • Internet: Quality Guidelines for Official Statistics, https://unstats.un.org/unsd/dnss/docs-nqaf/Finland-g_2ed_en.pdf, 10.04.2019.
  • Internet: Türkiye İstatistik Kanunu, http://tuik.gov.tr/jsp/duyuru/upload/TuikKanun.pdf, 12.04.2019.
  • M. Schierholz, Automating survey coding for occupation, Yüksek Lisans Tezi, Ludwig Maximilians Universitat, Institut fur Statistik, 2014.
  • W. Hacking, L. Willenborg, Theme: Coding; Interpreting Short Descriptions Using a Classification, Statistics Netherlands, The Hague/Heerlen, 2012.
  • C. C. Aggarwal, C. A. Zhai, “Survey of Text Classification Algorithms”, Mining Text Data, Springer, Boston, MA, 163-222, 2012.
  • R. M. Groves, F. J. Fowler, M. P. Couper, J. M. Lepkowski, E. Singer, R. Tourangeau, Survey Methodology, John Wiley & Sons, New Jersey, A.B.D., 2009.
  • A. Esuli, F. Sebastiani, “Machines That Learn How To Code Open-Ended Survey Data”, International Journal of Market Research, 52(6), 775-800, 2010.
  • J. Y. Tourigny, J. Moloney, Statistical Data Editing Volume No. 2 Methods And Techniques, United Nations Statistical Commission and Economic Commission for Europe, New York and Geneva, 1997.
  • P. Dalton, G. Keogh. “Automatic Coding of Occupations: The Irish Experience”. New Techniques and Technologies for Statistics II Proceedings of the Second Bonn Seminar, Netherlands: IOS Press, 33-44, 1997
  • F. G. Conrad, “Using Expert Systems to Model and Improve Survey Classification Processes”, Survey Measurement and Process Quality, John Wiley & Sons, New York, 393–414, 1997.
  • D. Bushnell, “An Evaluation of Computer-assisted Occupation Coding”, New Methods for Survey Research, Southampton, 23–36, 1998.
  • J. Fielding, N. Fielding, G. Hughes, “Opening Up Open-Ended Survey Data Using Qualitative Software”, Quality & Quantity, 47(6), 3261–3276, 2013.
  • M. Roessingh, J. Bethlehem, “Trigram coding in the family expenditure survey in statistics,” Netherlands Central Bureau of Statistics, 1983.
  • F. R. Clarke, S. Brooker, “Use of Machine Learning for Automated Survey Coding”, International Statistical Institute Proceedings of the 58th World Statistics Congress 2011, Dublin Convention Centre, İrlanda, 2011.
  • G. Alfons, Handbuch für die Berufsvercodung, Mannheim, 2011.
  • K. Drasch, B. Matthes, M. Munz, W. Paulus, M. A. Valentin, Arbeiten und Lernen im Wandel Teil V: Die Codierung der offenen Angaben zur beruflichen Tätigkeit, Ausbildung und Branche, Nuremberg, 2012.
  • Y. Jung, J. Yoo, S. H. Myaeng, D. C. Han, “A WebBased Automated System for Industry and Occupation Coding”, Web Information Systems Engineering - WISE 2008, Lecture Notes in Computer Science, 5175, 443–457, 2008.
  • F. G. Conrad, “Using Expert Systems To Model And Improve Survey Classification Processes”, Survey Measurement and Process Quality, John Wiley & Sons, New York, 393–414, 1997.
  • J. Hartmann, G. Schütz, Die Klassifizierung der Berufe und der Wirtschaftszweige im Sozio-oekonomischen Panel, Munich, 2002.
  • K. Chu, C. Poirier, “Machine learning documentation initiative (Canada)”, Workshop on the Modernisation of Statistical Production, İsviçre, 2015.
  • A. Bethmann, M. Schierholz, K. Wenzig, M. Zielonka, “Automatic Coding of Occupations Using Machine Learning Algorithms for Occupation Coding in Several German Panel Surveys”, Beyond traditional survey taking. Adapting to a changing world, Kanada, 2014.
  • M. Belloni, A. Brugiavini, E. Meschi, K. Tijdens, “Measuring and detecting errors in occupational coding: an analysis of share data”, Journal of Official Statistics, 32(4), 917-945, 2016.
  • Internet: M. Beck, F. Dumpert, J. Feuerhake, Machine Learning in Official Statistics, https://arxiv.org/abs/1812.10422v1, 21.04.2019.
  • S. By De Matteis, D. Jarvis, H. Young, A. Young, N. Allen, J. Potts, A. Darnton, L. Rushton, P. Cullinan, ”Occupational self coding and automatic recording (OSCAR): an innovative validated web-based tool to collect lifetime job histories in large population”, Scandinavian Journal of Work, Environment & Health, 43(2), 181-186,2017.
  • A. Haslinger. Automatic Coding and Text Processing using N-grams. In Conference of European Statisticians. Statistical Standards and Studies – No. 48. Statistical Data Editing, Volume No. 2, Methods and Techniques, pages 199-209. UNO, New York and Geneva, 1997.
  • J. Hilden, J.D.F. Habbema, B. Bjerregaard, “The measurement of performance in probabilistic diagnosis, I. The problem, descriptive tools, and measures based on classification matrices”, Methods of information in medicine, 17, 217-226, 1978.
  • J. Hilden, J.D.F. Habbema, B. Bjerregaard, “The measurement of performance in probabilistic diagnosis, II. Trustworthiness of the exact values of the diagnostic probabilities”, Methods of information in medicine, 17, 227- 237, 1978.
  • J. Hilden, J.D.F. Habbema, B. Bjerregaard, “The measurement of performance in probabilistic diagnosis, III. Methods based on continuous 54 functions of the diagnostic probabilities”, Methods of information in medicine, 17, 238-246, 1978.
  • Hanehalkı Bütçe İstatistikleri Mikro Veri Seti CD., Türkiye İstatistik Kurumu, Ankara, 2017.
  • Internet: Hanehalkı Bütçe Anketinin Kapsamı, Yöntemi, Tanım ve Kavramları Hakkında Genel Açıklamalar, http://tuik.gov.tr/HbGetir.do?id=27840&tb_id=7, 12.04.2019.
  • Internet: Household Budget Surveys in the EU, Methodology and recommendations for harmonisation, https://ec.europa.eu/eurostat/web/products-manuals-and-guidelines/-/KS-BF-03-003?inheritRedirect=true, 12.04.2019.
  • Internet: Classification of Individual Consumption According to Purpose (COICOP) 2018, https://unstats.un.org/unsd/classifications/unsdclassifications/COICOP_2018_-_pre-edited_white_cover_version_-_2018-12-26.pdf, 12.04.2019.
  • Internet: Household Budget Survey (HBS), https://ec.europa.eu/eurostat/web/household-budget-surveys/policy-context, 12.04.2019.
There are 39 citations in total.

Details

Primary Language Turkish
Subjects Computer Software
Journal Section Articles
Authors

Levent Ahi 0000-0002-7415-1173

Ebru Kılıç Çakmak 0000-0002-3459-6290

Publication Date July 31, 2020
Submission Date July 7, 2019
Published in Issue Year 2020 Volume: 13 Issue: 3

Cite

APA Ahi, L., & Kılıç Çakmak, E. (2020). Uluslararası İstatistiksel Sınıflamalara Yönelik Kod Atama Sistemi (KASİS). Bilişim Teknolojileri Dergisi, 13(3), 313-327. https://doi.org/10.17671/gazibtd.588097