Efficient and Accurate Date Extraction from Invoices: A Comprehensive Three-Step Methodology Integrating Custom Object Detection, OCR, and Refined Regular Expressions

Mehmet Hilmi Emel; Murat Terzioğlu; Ramazan Özkan

doi:10.54569/aair.1401234

Research Article

Efficient and Accurate Date Extraction from Invoices: A Comprehensive Three-Step Methodology Integrating Custom Object Detection, OCR, and Refined Regular Expressions

Year 2024, Volume: 4 Issue: 1, 10 - 17, 30.08.2024

Mehmet Hilmi Emel , Murat Terzioğlu , Ramazan Özkan

https://doi.org/10.54569/aair.1401234

Abstract

In the realm of contemporary document processing, the challenge of extracting crucial information from diverse invoices necessitates innovative solutions. This article presents a comprehensive three-step methodology to address the complexity of date extraction from invoices. Leveraging LabelStudio, Python, and OpenCV, we constitute a dataset and train a custom object detection model using Ultralytics YOLOv8. Optical Character Recognition (OCR) provides us to convert the image data to string data that is enable to be processed. Regular expressions refine the extracted text, achieving precise date formats. The developed system significantly enhance the time efficiency, marking a noteworthy advancement in date extraction from invoices.

Keywords

Custom Object Detection , OCR , Invoice Processing , YOLOv8

References

Open Source Data Labeling | Label Studio. (n.d.). Label Studio. https://labelstud.io/
Ultralytics. (n.d.). GitHub - ultralytics/ultralytics: NEW - YOLOv8 in PyTorch > ONNX > OpenVINO > CoreML > TFLite. GitHub. https://github.com/ultralytics/ultralytics
PaddlePaddle. (n.d.). GitHub - PaddlePaddle/PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices). GitHub. https://github.com/PaddlePaddle/PaddleOCR
M. S. Satav, T. Varade, D. Kothavale, S. Thombare and P. Lokhande, "Data Extraction From Invoices Using Computer Vision," 2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS), RUPNAGAR, India, 2020, pp. 316-320, doi: 10.1109/ICIIS51140.2020.9342722.
D. A. Kosiba and R. Kasturi, "Automatic invoice interpretation: invoice structure analysis," Proceedings of 13th International Conference on Pattern Recognition, Vienna, Austria, 1996, pp. 721-725 vol.3, doi: 10.1109/ICPR.1996.547263.
H. Sidhwa, S. Kulshrestha, S. Malhotra and S. Virmani, "Text Extraction from Bills and Invoices," 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 2018, pp. 564-568, doi: 10.1109/ICACCCN.2018.8748309.
Tesseract-Ocr. (n.d.). GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository). GitHub. https://github.com/tesseract-ocr/tesseract
R. Smith, "An Overview of the Tesseract OCR Engine," Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 2007, pp. 629-633, doi: 10.1109/ICDAR.2007.4376991.
Ar, I., Karsligil, M.E. (2007). Text Area Detection in Digital Documents Images Using Textural Features. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds) Computer Analysis of Images and Patterns. CAIP 2007. Lecture Notes in Computer Science, vol 4673. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74272-2_69
Zhang, X., Duan, L., Ma, L., Wu, J. (2017). Text Extraction for Historical Tibetan Document Images Based on Connected Component Analysis and Corner Point Detection. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 772. Springer, Singapore. https://doi.org/10.1007/978-981-10-7302-1_45
Dachengwang,. (2011). ANALYSIS OF FORM IMAGES. International Journal of Pattern Recognition and Artificial Intelligence. 08. 10.1142/S0218001494000528.

Year 2024, Volume: 4 Issue: 1, 10 - 17, 30.08.2024

Mehmet Hilmi Emel , Murat Terzioğlu , Ramazan Özkan

https://doi.org/10.54569/aair.1401234

Abstract

References

Open Source Data Labeling | Label Studio. (n.d.). Label Studio. https://labelstud.io/
Ultralytics. (n.d.). GitHub - ultralytics/ultralytics: NEW - YOLOv8 in PyTorch > ONNX > OpenVINO > CoreML > TFLite. GitHub. https://github.com/ultralytics/ultralytics
PaddlePaddle. (n.d.). GitHub - PaddlePaddle/PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices). GitHub. https://github.com/PaddlePaddle/PaddleOCR
M. S. Satav, T. Varade, D. Kothavale, S. Thombare and P. Lokhande, "Data Extraction From Invoices Using Computer Vision," 2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS), RUPNAGAR, India, 2020, pp. 316-320, doi: 10.1109/ICIIS51140.2020.9342722.
D. A. Kosiba and R. Kasturi, "Automatic invoice interpretation: invoice structure analysis," Proceedings of 13th International Conference on Pattern Recognition, Vienna, Austria, 1996, pp. 721-725 vol.3, doi: 10.1109/ICPR.1996.547263.
H. Sidhwa, S. Kulshrestha, S. Malhotra and S. Virmani, "Text Extraction from Bills and Invoices," 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 2018, pp. 564-568, doi: 10.1109/ICACCCN.2018.8748309.
Tesseract-Ocr. (n.d.). GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository). GitHub. https://github.com/tesseract-ocr/tesseract
R. Smith, "An Overview of the Tesseract OCR Engine," Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 2007, pp. 629-633, doi: 10.1109/ICDAR.2007.4376991.
Ar, I., Karsligil, M.E. (2007). Text Area Detection in Digital Documents Images Using Textural Features. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds) Computer Analysis of Images and Patterns. CAIP 2007. Lecture Notes in Computer Science, vol 4673. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74272-2_69
Zhang, X., Duan, L., Ma, L., Wu, J. (2017). Text Extraction for Historical Tibetan Document Images Based on Connected Component Analysis and Corner Point Detection. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 772. Springer, Singapore. https://doi.org/10.1007/978-981-10-7302-1_45
Dachengwang,. (2011). ANALYSIS OF FORM IMAGES. International Journal of Pattern Recognition and Artificial Intelligence. 08. 10.1142/S0218001494000528.

There are 11 citations in total.

Details

Primary Language	English
Subjects	Image Processing, Modelling and Simulation
Journal Section	Research Articles
Authors	Mehmet Hilmi Emel 0009-0003-4236-4590 Murat Terzioğlu 0009-0005-2645-2616 Ramazan Özkan 0009-0000-9445-9338
Publication Date	August 30, 2024
Submission Date	December 6, 2023
Acceptance Date	August 30, 2024
Published in Issue	Year 2024 Volume: 4 Issue: 1

Cite

IEEE	M. H. Emel, M. Terzioğlu, and R. Özkan, “Efficient and Accurate Date Extraction from Invoices: A Comprehensive Three-Step Methodology Integrating Custom Object Detection, OCR, and Refined Regular Expressions”, Adv. Artif. Intell. Res., vol. 4, no. 1, pp. 10–17, 2024, doi: 10.54569/aair.1401234.

Download Cover Image

Article Files

Full Text

Advances in Artificial Intelligence Research is an open access journal which means that the content is freely available without charge to the user or his/her institution. All papers are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which allows users to distribute, remix, adapt, and build upon the material in any medium or format for non-commercial purposes only, and only so long as attribution is given to the creator.

Graphic design @ Özden Işıktaş