Kod Atama Sistemi (KASİS) ile Otomatik Kod Atama

Levent Ahi; Ebru Kılıç Çakmak

Araştırma Makalesi

Automatic Code Assignment with Code Assignment System (KASIS)

Yıl 2020, Cilt: 2 Sayı: 1, 73 - 87, 22.06.2020

Öz

In this study, two different methods were introduced in Code Assignment System (KASIS), which was developed to make a more consistent, reliable and systematic coding to unencoded records or to analyse the accuracy of the records coded by the interviewer. The code assignment process consists of converting the textual definition into the most appropriate code in the classification dictionary, which is created as a standard. Turkey Statistical Institute (TURKSTAT) is implementing Household Budget Survey (HBS) annually. The accuracy of the codes assigned by the interviewer by manual methods was checked by the system using the HBS consumption expenditure data set made in 2016-2018.Code assignment was re-assigned to the records classified as inadequate and suspicious by two different methods through the system. The first of these methods; to make code assignment for the records with insufficient coding and suspect using the narrowed list in the classification dictionary. Latter; code assignment using the large list directly in the classification dictionary. Fuzzy matching techniques were used in both methods. Fuzzy matching techniques use algorithms developed to measure the similarity of the two texts. In this study, the effectiveness of the two methods along with the accuracy of the interviewer coding were evaluated. As a result, it was concluded that the first method applied to the narrowed list gives better results compared to the other and coding with automatic methods gives effective results.

Anahtar Kelimeler

Statistical Classification , COICOP , HBS , Automatic Coding , Fuzzy Matching

Kaynakça

Referans1 Ananthakrishna, R., Chaudhuri, S., Ganti, V. (2002). Eliminating Fuzzy Duplicates in Data Warehouses. Paper presented at Proceedings of the Very Large Databases Conference.
Referans2 Ariel, A., Bakker, B., de Groot, M., Grootheest, G., Laan, J., Smit, J., Verkerk, B. (2014). Record linkage in health data: a simulation study.
Referans3 Belloni, M., Brugiavini, A., Meschi, E., Tijdens, K. (2016). Measuring and detecting errors in occupational coding: an analysis of share data. Journal of Official Statistics, 32(4), 917-945.
Referans4 Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J. (2005). Swoosh: A Generic Approach to Entity Resolution, Stanford University technical report, Stanford.
Referans5 Bethmann, A., Schierholz, M., Wenzig, K., Zielonka M. (2014). Automatic Coding of Occupations Using Machine Learning Algorithms for Occupation Coding in Several German Panel Surveys. Paper presented at Proceedings of Statistics Canada Symposium, Canada.
Referans6 Clarke, F. R., Brooker S. (2011). Use of Machine Learning for Automated Survey Coding. Paper presented at International Statistical Institute Proceedings of the 58th World Statistics Congress, Dublin.
Referans7 Do, H.H., Rahm, E. (2001). COMA – A system for flexible combination of schema matching approaches. Paper presented at Proceedings of the Very Large Databases Conference.
Referans8 Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., Srivastava, D. (2001). Approximate String Joins in a Database (Almost) for Free. Paper presented at Proceedings of the Very Large Databases Conference.
Referans9 Gu, L., Baxter, R., Vickers, D., Rainsford, C. (2003). Record linkage: Current practice and future directions, Commonwealth Scientific and Industrial Research Organisation. Mathematical and Information Science, 3.
Referans10 Hacking, W., Willenborg, L. (2012). Theme: Coding; Interpreting Short Descriptions Using a Classification, The Hague/Heerlen: Statistics Netherlands, 4-11.
Referans11 Hilden, J., Habbema, J.D.F and Bjerregaard, B. (1978a). The measurement of performance in probabilistic diagnosis, I. The problem, descriptive tools, and measures based on classification matrices. Methods of information in medicine, 17, 217-226.
Referans12 Hilden, J., Habbema, J.D.F and Bjerregaard, B. (1978b). The measurement of performance in probabilistic diagnosis, II. Trustworthiness of the exact values of the diagnostic probabilities. Methods of information in medicine, 17, 227- 237.
Referans13 Hilden, J., Habbema, J.D.F and Bjerregaard, B. (1978c). The measurement of performance in probabilistic diagnosis, III. Methods based on continuous 54 functions of the diagnostic probabilities. Methods of information in medicine, 17, 238-246.
Referans14 Schierholz, M. (2014). Automating Survey Coding for Occupation. Yüksek Lisans Tezi. Ludwig Maximilians Universitat Institut fur Statistik, Munchen, 70.
Referans15 Simões, M.d.G., Freitas, M. C. V. d., Rodríguez-Bravo, B. (2016). Theory of classification and classification in libraries and archives: Convergences and divergences. Knowledge Organization, 43(7), 530-538.
Referans16 Statistics Canada Reports on Special Business Projects an Overview of Selected International Business Record Linkage Programs. (2016). Erişim adresi: https://www150.statcan.gc.ca/n1/pub/18-001-x/18-001-x2016001-eng.htm, Son Erişim Tarihi: 03.05.2020.
Referans17 Tejada, S., Knoblock, C., Minton, S. (2001). Learning Object Identification Rules for Information Extraction. Information Systems, 26 (8), 607-633.
Referans18 Türkiye İstatistik Kurumu Hanehalkı Bütçe Anketi Mikro Veri Seti, CD. (2016). Ankara.
Referans19 Türkiye İstatistik Kurumu Hanehalkı Bütçe Anketi Mikro Veri Seti, CD. (2017). Ankara.
Referans20 Türkiye İstatistik Kurumu Hanehalkı Bütçe Anketi Mikro Veri Seti, CD. (2018). Ankara.
Referans21 Türkiye İstatistik Kurumu Sınıflama Sunucusu. (2006). Erişim adresi: https://biruni.tuik.gov.tr/DIESS/, Son Erişim Tarihi: 10.05.2020.
Referans22 Türkiye İstatistik Kurumu Sınıflama Sunucusu Amaca Göre Sınıflamalar. (2006). Erişim adresi: https://biruni.tuik.gov.tr/DIESS/SiniflamaSurumListeAction.do?turId=5&turAdi=%204.%20Amaca%20G%C3%B6re%20S%C4%B1n%C4%B1flamalar&guncel=Y, Son Erişim Tarihi: 03.05.2020.
Referans23 Türkiye İstatistik Kurumu Sınıflama Sunucusu Sınıflama Türleri. (2006). Erişim adresi: https://biruni.tuik.gov.tr/DIESS/SiniflamaTurListeAction.do, Son Erişim Tarihi: 03.05.2020.
Referans24 van Herk-Sukel, M. P., Lemmens, V. E., van de Poll-Franse, L., Herings, R. M., Coebergh, J. W. (2012). Record linkage for pharmacoepidemiological studies in cancer patients. Pharmacoepidemiology and Drug Safety, 21, 94–103.
Referans25 Wright, G. (2011). Probabilistic Record Linkage in SAS. Paper presented at Proceedings of Western Users of SAS Software, California.

Kod Atama Sistemi (KASİS) ile Otomatik Kod Atama

Yıl 2020, Cilt: 2 Sayı: 1, 73 - 87, 22.06.2020

Levent Ahi , Ebru Kılıç Çakmak

Öz

Bu çalışmada, kodlaması yapılmamış kayıtlara daha tutarlı, güvenilir ve sistematik bir kodlama yapabilmek veya anketör tarafından kodlaması yapılan kayıtların doğruluğunu analiz etmek için geliştirilen Kod Atama Sistemi’nde (KASİS) kullanılan iki farklı kod atama yöntemi tanıtılmıştır. Kod atama süreci, elimizdeki metinsel tanımı standart olarak oluşturulmuş sınıflama sözlüğünde yer alan en uygun koda dönüştürme işleminden oluşmaktadır. Türkiye İstatistik Kurumu (TÜİK), Hanehalkı Bütçe Araştırması’nı (HBA) yıllık olarak uygulamaktadır. Öncelikle, manuel yöntemlerle anketör tarafından atanan kodların doğruluğu 2016-2018 yıllarında yapılmış HBA tüketim harcaması veri seti kullanılarak sistem tarafından kontrol edilmiştir. Daha sonra, kod ataması, yetersiz ve şüpheli olarak sınıflanan kayıtlara sistem aracılığıyla iki farklı yöntemle tekrar kod ataması gerçekleştirilmiştir. Bu yöntemlerden birincisi; kodlaması yetersiz ve şüpheli olan kayıtlar için sınıflama sözlüğündeki daraltılmış liste kullanılarak kod ataması gerçekleştirmektir. İkincisi; direkt olarak sınıflama sözlüğündeki geniş liste kullanılarak kod ataması gerçekleştirmektir. Her iki yöntemde de bulanık eşleştirme teknikleri kullanılmıştır. Bulanık eşleştirme teknikleri, iki metnin benzerliğini ölçebilmek amacıyla geliştirilen algoritmaları kullanmaktadır. Çalışmada, anketör kodlamasının doğruluğu ile birlikte iki yöntemin etkinliği de değerlendirilmiştir. Sonuç olarak, daraltılmış listeye uygulanan ilk yöntemin diğerine kıyasla daha iyi sonuç verdiği ve kodlamanın otomatik yöntemler ile yapılmasının etkili sonuçlar vereceği sonucuna ulaşılmıştır.

Anahtar Kelimeler

İstatistiksel sınıflama , COICOP , HBA , Otomatik kodlama , Bulanık eşleştirme

Kaynakça

Referans1 Ananthakrishna, R., Chaudhuri, S., Ganti, V. (2002). Eliminating Fuzzy Duplicates in Data Warehouses. Paper presented at Proceedings of the Very Large Databases Conference.
Referans2 Ariel, A., Bakker, B., de Groot, M., Grootheest, G., Laan, J., Smit, J., Verkerk, B. (2014). Record linkage in health data: a simulation study.
Referans3 Belloni, M., Brugiavini, A., Meschi, E., Tijdens, K. (2016). Measuring and detecting errors in occupational coding: an analysis of share data. Journal of Official Statistics, 32(4), 917-945.
Referans4 Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J. (2005). Swoosh: A Generic Approach to Entity Resolution, Stanford University technical report, Stanford.
Referans5 Bethmann, A., Schierholz, M., Wenzig, K., Zielonka M. (2014). Automatic Coding of Occupations Using Machine Learning Algorithms for Occupation Coding in Several German Panel Surveys. Paper presented at Proceedings of Statistics Canada Symposium, Canada.
Referans6 Clarke, F. R., Brooker S. (2011). Use of Machine Learning for Automated Survey Coding. Paper presented at International Statistical Institute Proceedings of the 58th World Statistics Congress, Dublin.
Referans7 Do, H.H., Rahm, E. (2001). COMA – A system for flexible combination of schema matching approaches. Paper presented at Proceedings of the Very Large Databases Conference.
Referans8 Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., Srivastava, D. (2001). Approximate String Joins in a Database (Almost) for Free. Paper presented at Proceedings of the Very Large Databases Conference.
Referans9 Gu, L., Baxter, R., Vickers, D., Rainsford, C. (2003). Record linkage: Current practice and future directions, Commonwealth Scientific and Industrial Research Organisation. Mathematical and Information Science, 3.
Referans10 Hacking, W., Willenborg, L. (2012). Theme: Coding; Interpreting Short Descriptions Using a Classification, The Hague/Heerlen: Statistics Netherlands, 4-11.
Referans11 Hilden, J., Habbema, J.D.F and Bjerregaard, B. (1978a). The measurement of performance in probabilistic diagnosis, I. The problem, descriptive tools, and measures based on classification matrices. Methods of information in medicine, 17, 217-226.
Referans12 Hilden, J., Habbema, J.D.F and Bjerregaard, B. (1978b). The measurement of performance in probabilistic diagnosis, II. Trustworthiness of the exact values of the diagnostic probabilities. Methods of information in medicine, 17, 227- 237.
Referans13 Hilden, J., Habbema, J.D.F and Bjerregaard, B. (1978c). The measurement of performance in probabilistic diagnosis, III. Methods based on continuous 54 functions of the diagnostic probabilities. Methods of information in medicine, 17, 238-246.
Referans14 Schierholz, M. (2014). Automating Survey Coding for Occupation. Yüksek Lisans Tezi. Ludwig Maximilians Universitat Institut fur Statistik, Munchen, 70.
Referans15 Simões, M.d.G., Freitas, M. C. V. d., Rodríguez-Bravo, B. (2016). Theory of classification and classification in libraries and archives: Convergences and divergences. Knowledge Organization, 43(7), 530-538.
Referans16 Statistics Canada Reports on Special Business Projects an Overview of Selected International Business Record Linkage Programs. (2016). Erişim adresi: https://www150.statcan.gc.ca/n1/pub/18-001-x/18-001-x2016001-eng.htm, Son Erişim Tarihi: 03.05.2020.
Referans17 Tejada, S., Knoblock, C., Minton, S. (2001). Learning Object Identification Rules for Information Extraction. Information Systems, 26 (8), 607-633.
Referans18 Türkiye İstatistik Kurumu Hanehalkı Bütçe Anketi Mikro Veri Seti, CD. (2016). Ankara.
Referans19 Türkiye İstatistik Kurumu Hanehalkı Bütçe Anketi Mikro Veri Seti, CD. (2017). Ankara.
Referans20 Türkiye İstatistik Kurumu Hanehalkı Bütçe Anketi Mikro Veri Seti, CD. (2018). Ankara.
Referans21 Türkiye İstatistik Kurumu Sınıflama Sunucusu. (2006). Erişim adresi: https://biruni.tuik.gov.tr/DIESS/, Son Erişim Tarihi: 10.05.2020.
Referans22 Türkiye İstatistik Kurumu Sınıflama Sunucusu Amaca Göre Sınıflamalar. (2006). Erişim adresi: https://biruni.tuik.gov.tr/DIESS/SiniflamaSurumListeAction.do?turId=5&turAdi=%204.%20Amaca%20G%C3%B6re%20S%C4%B1n%C4%B1flamalar&guncel=Y, Son Erişim Tarihi: 03.05.2020.
Referans23 Türkiye İstatistik Kurumu Sınıflama Sunucusu Sınıflama Türleri. (2006). Erişim adresi: https://biruni.tuik.gov.tr/DIESS/SiniflamaTurListeAction.do, Son Erişim Tarihi: 03.05.2020.
Referans24 van Herk-Sukel, M. P., Lemmens, V. E., van de Poll-Franse, L., Herings, R. M., Coebergh, J. W. (2012). Record linkage for pharmacoepidemiological studies in cancer patients. Pharmacoepidemiology and Drug Safety, 21, 94–103.
Referans25 Wright, G. (2011). Probabilistic Record Linkage in SAS. Paper presented at Proceedings of Western Users of SAS Software, California.

Toplam 25 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Mühendislik, Bilgisayar Yazılımı
Bölüm	Araştırma Makaleleri
Yazarlar	Levent Ahi 0000-0002-7415-1173 Ebru Kılıç Çakmak 0000-0002-3459-6290
Yayımlanma Tarihi	22 Haziran 2020
Gönderilme Tarihi	10 Mayıs 2020
Kabul Tarihi	21 Haziran 2020
Yayımlandığı Sayı	Yıl 2020 Cilt: 2 Sayı: 1

Kaynak Göster

APA	Ahi, L., & Kılıç Çakmak, E. (2020). Kod Atama Sistemi (KASİS) ile Otomatik Kod Atama. Bilgi ve İletişim Teknolojileri Dergisi, 2(1), 73-87.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin

34692

23655

Bilgi ve İletişim Teknolojileri Dergisi (BİTED)

Journal of Information and Communication Technologies