Research Article

Comparison of Different Classification Algorithms for Extraction Information from Invoice Images Using an N-Gram Approach

Number: 31 December 31, 2021
TR EN

Comparison of Different Classification Algorithms for Extraction Information from Invoice Images Using an N-Gram Approach

Abstract

Artificial intelligence (AI) has started to be used in many areas today. One of these areas is the accounting sector. Accounting companies may sometimes be inadequate especially in the face of intense invoicing transactions of large companies. This problem raised the need to process invoices by an Artificial Intelligence powered system. The goal of this work is to determine the best machine learning model to extract information such as invoice number, invoice date, due date, delivery date, total gross, total net, vat amount and IBAN from the invoice image files. Information obtained by the Tesseract Optical Character Recognition (OCR) system has been converted into n-gram format. A number of attributes of the n-gram are calculated such as the coordinates, the length, the width, the line number, the template information of n-grams, the Levenshtein and the Jaro-Winkler distances between the candidate n-grams and the keywords in the control keywords list. The use of the Levenshtein distance between candidate n-grams and the control keywords has resulted in a sufficiently high predictive rate. The most appropriate model and features are determined for the training. Algorithms such as Random Forest, Gradient Boosting Machine, Extreme Gradient Boosting, K-Nearest Neighbors, AdaBoost and Decision Tree were compared as prediction models. A total of 9910 invoices were used by splitting 80% for training and 20% for testing. It was observed that the Random Forest model using the Levenshtein distance is the best model with an average F1 score of 0.9137.

Keywords

References

  1. Aydın C (2018) Makine Öğrenmesi Algoritmaları Kullanılarak İtfaiye İstasyonu İhtiyacının Sınıflandırılması. Avrupa Bilim ve Teknoloji Dergisi. 14(4):169–175.
  2. Breiman L (2001) Random Forests. Machine learning 45(1):5–32.
  3. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 16:321–357.
  4. Esser D, Schuster D, Muthmann K, Berger M, Schill A (2012) Automatic indexing of scanned documents: a layout-based approach. Document Recognition and Retrieval XIX, 8297, 82970H.
  5. Gelfand SB, Ravishankar CS, Delp EJ (1991) An iterative growing and pruning algorithm for classification tree design. IEEE Transaction on Pattern Analysis and Machine Intelligence 13(2):163-174.
  6. Haldar R, Mukhopadhyay D (2011) Levenshtein distance technique in dictionary lookup methods: an improved approach. https://arxiv.org/abs/1101.1232. Accessed 23.05.2020.
  7. Jaro A (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa. Journal of the America Statistical Association. 84(406):414-420.
  8. Katti A, Reisswig C, Guder C, Brarda S, Bickel S, Hohne J, Faddoul J (2018) Chargrid: Towards understanding 2d documents. Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium, pp 4459-4469.

Details

Primary Language

English

Subjects

Engineering

Journal Section

Research Article

Publication Date

December 31, 2021

Submission Date

December 24, 2020

Acceptance Date

October 6, 2021

Published in Issue

Year 2021 Number: 31

APA
Nasiboglu, R., & Akdoğan, A. (2021). Comparison of Different Classification Algorithms for Extraction Information from Invoice Images Using an N-Gram Approach. Avrupa Bilim Ve Teknoloji Dergisi, 31, 991-1003. https://doi.org/10.31590/ejosat.844862