Araştırma Makalesi

Comparison of Different Classification Algorithms for Extraction Information from Invoice Images Using an N-Gram Approach

Sayı: 31 31 Aralık 2021
PDF İndir
TR EN

Comparison of Different Classification Algorithms for Extraction Information from Invoice Images Using an N-Gram Approach

Abstract

Artificial intelligence (AI) has started to be used in many areas today. One of these areas is the accounting sector. Accounting companies may sometimes be inadequate especially in the face of intense invoicing transactions of large companies. This problem raised the need to process invoices by an Artificial Intelligence powered system. The goal of this work is to determine the best machine learning model to extract information such as invoice number, invoice date, due date, delivery date, total gross, total net, vat amount and IBAN from the invoice image files. Information obtained by the Tesseract Optical Character Recognition (OCR) system has been converted into n-gram format. A number of attributes of the n-gram are calculated such as the coordinates, the length, the width, the line number, the template information of n-grams, the Levenshtein and the Jaro-Winkler distances between the candidate n-grams and the keywords in the control keywords list. The use of the Levenshtein distance between candidate n-grams and the control keywords has resulted in a sufficiently high predictive rate. The most appropriate model and features are determined for the training. Algorithms such as Random Forest, Gradient Boosting Machine, Extreme Gradient Boosting, K-Nearest Neighbors, AdaBoost and Decision Tree were compared as prediction models. A total of 9910 invoices were used by splitting 80% for training and 20% for testing. It was observed that the Random Forest model using the Levenshtein distance is the best model with an average F1 score of 0.9137.

Keywords

Kaynakça

  1. Aydın C (2018) Makine Öğrenmesi Algoritmaları Kullanılarak İtfaiye İstasyonu İhtiyacının Sınıflandırılması. Avrupa Bilim ve Teknoloji Dergisi. 14(4):169–175.
  2. Breiman L (2001) Random Forests. Machine learning 45(1):5–32.
  3. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 16:321–357.
  4. Esser D, Schuster D, Muthmann K, Berger M, Schill A (2012) Automatic indexing of scanned documents: a layout-based approach. Document Recognition and Retrieval XIX, 8297, 82970H.
  5. Gelfand SB, Ravishankar CS, Delp EJ (1991) An iterative growing and pruning algorithm for classification tree design. IEEE Transaction on Pattern Analysis and Machine Intelligence 13(2):163-174.
  6. Haldar R, Mukhopadhyay D (2011) Levenshtein distance technique in dictionary lookup methods: an improved approach. https://arxiv.org/abs/1101.1232. Accessed 23.05.2020.
  7. Jaro A (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa. Journal of the America Statistical Association. 84(406):414-420.
  8. Katti A, Reisswig C, Guder C, Brarda S, Bickel S, Hohne J, Faddoul J (2018) Chargrid: Towards understanding 2d documents. Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium, pp 4459-4469.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Mühendislik

Bölüm

Araştırma Makalesi

Yayımlanma Tarihi

31 Aralık 2021

Gönderilme Tarihi

24 Aralık 2020

Kabul Tarihi

6 Ekim 2021

Yayımlandığı Sayı

Yıl 2021 Sayı: 31

Kaynak Göster

APA
Nasiboglu, R., & Akdoğan, A. (2021). Comparison of Different Classification Algorithms for Extraction Information from Invoice Images Using an N-Gram Approach. Avrupa Bilim ve Teknoloji Dergisi, 31, 991-1003. https://doi.org/10.31590/ejosat.844862