Classification of Documents Extracted from Images with Optical Character Recognition Methods

Ömer Aydın

Research Article

Classification of Documents Extracted from Images with Optical Character Recognition Methods

Year 2021, Volume: 6 Issue: 2, 46 - 55, 01.06.2021

Abstract

Over the past decade, machine learning methods have given us driverless cars, voice recognition, effective web search, and a much better understanding of the human genome. Machine learning is so common today that it is used dozens of times a day, possibly unknowingly. Trying to teach a machine some processes or some situations can make them predict some results that are difficult to predict by the human brain. These methods also help us do some operations that are often impossible or difficult to do with human activities in a short time. For these reasons, machine learning is so important today. In this study, two different machine learning methods were combined. In order to solve a real-world problem, the manuscript documents were first transferred to the computer and then classified. We used three basic methods to realize the whole process. Handwriting or printed documents have been digitalized by a scanner or digital camera. These documents have been processed with two different Optical Character Recognition (OCR) operation. After that generated texts are classified by using Naive Bayes algorithm. All project was programmed in Microsoft Visual Studio 12 platform on Windows operating system. C# programming language was used for all parts of the study. Also, some prepared codes and DLLs were used.

Keywords

Optical Character Recognition , OCR , Classification , Naive Bayes , Machine Learning , Text mining , Image processing

References

Cord, M., & Cunningham, P. (2007). Machine Learning Techniques for Multimedia. 2008, 251-262.
Holmes, G., Donkin, A., & Witten, I. H. (1994, November). Weka: A machine learning workbench. In Proceedings of ANZIIS'94-Australian New Zealnd Intelligent Information Systems Conference (pp. 357-361). IEEE.
Kim, S. B., Han, K. S., Rim, H. C., & Myaeng, S. H. (2006). Some effective techniques for naive bayes text classification. IEEE transactions on knowledge and data engineering, 18(11), 1457-1466.
Kirillov, A. (2013). Aforge. net framework. Retrieved September 25th from http://www. aforgenet. com, 68, 47-52.
Manchanda, P., Gupta, S., & Bhatia, K. K. (2012). On the automated classification of web pages using artificial neural network. IOSRJCE, ISSN, 2278-066.
Octave, G. N. U. (2012). Gnu octave. línea]. Available: http://www. gnu. org/software/octave.
Qiang, G. (2010, May). An effective algorithm for improving the performance of Naive Bayes for text classification. In 2010 Second international conference on computer research and development.
Singh, P., & Budhiraja, S. (2011). Feature extraction and classification techniques in OCR systems for handwritten Gurmukhi Script–a survey. International Journal of Engineering Research and Applications (IJERA), 1(4), 1736-1739.
Simulink, M., & Natick, M. A. (1993). The mathworks.
Sun, W., Liu, L. M., Zhang, W., & Comfort, J. C. (1992). Intelligent OCR processing. Journal of the American Society for Information Science, 43(6), 422-431.
Usta, R. (2014) Naïve Bayes Sınıflandırma Algoritması. Kodedu.com. https://kodedu.com/2014/05/naive-bayes-siniflandirma-algoritmasi/
Wemhoener, D., Yalniz, I. Z., & Manmatha, R. (2013, August). Creating an improved version using noisy OCR from multiple editions. In 2013 12th International Conference on Document Analysis and Recognition (pp. 160-164). IEEE.

Optik Karakter Tanıma Yöntemleriyle Görsellerden Elde Edilen Metinlerin Sınıflandırılması

Year 2021, Volume: 6 Issue: 2, 46 - 55, 01.06.2021

Ömer Aydın

Abstract

Son on yılda, makine öğrenimi yöntemleri sürücüsüz arabalar, ses tanıma, etkili web araması ve insan genomunun çok daha iyi anlaşılması gibi birçok alanda katkı sağladı. Makine öğrenimi bugün o kadar yaygındır ki, muhtemelen farkında olmadan günde onlarca kez kullanılmaktadır. Bir makineye bazı süreçleri veya bazı durumları öğretmeye çalışmak, insan beyni tarafından tahmin edilmesi zor olan bazı sonuçları tahmin etmelerini sağlayabilir. Bu yöntemler aynı zamanda insan faaliyetleriyle genellikle kısa sürede yapılması imkânsız veya zor olan bazı işlemleri yapmamıza yardımcı olur. Bu nedenlerden dolayı, makine öğrenimi bugün çok önemlidir. Bu çalışmada, iki farklı makine öğrenimi yöntemi birleştirilmiştir. Gerçek dünyadaki bir sorunu çözmek için yapılan bu çalışmada, el yazması belgeleri önce bilgisayar ortamına aktarıldı ve sonra sınıflandırıldı. Tüm süreci gerçekleştirmek için üç temel yöntem kullanıldı. El yazısı veya basılı belgeler bir tarayıcı veya dijital kamera ile dijitalleştirilmiştir. Bu belgeler iki farklı optik karakter tanıma (OCR) işlemiyle işlenmiştir. Daha sonra üretilen metinler Naive Bayes algoritması kullanılarak sınıflandırılmıştır. Tüm proje Windows işletim sistemi üzerinde Microsoft Visual Studio 12 platformunda programlanmıştır. Çalışmanın tüm bölümlerinde C# programlama dili kullanılmıştır. Ayrıca hazırlanan bazı kodlar ve DLL'ler kullanılmıştır.

Keywords

Optik karakter tanıma , OCR , Sınıflandırma , Naive Bayes , Makine Öğrenimi , Metin madenciliği , Görüntü işleme

References

Cord, M., & Cunningham, P. (2007). Machine Learning Techniques for Multimedia. 2008, 251-262.
Holmes, G., Donkin, A., & Witten, I. H. (1994, November). Weka: A machine learning workbench. In Proceedings of ANZIIS'94-Australian New Zealnd Intelligent Information Systems Conference (pp. 357-361). IEEE.
Kim, S. B., Han, K. S., Rim, H. C., & Myaeng, S. H. (2006). Some effective techniques for naive bayes text classification. IEEE transactions on knowledge and data engineering, 18(11), 1457-1466.
Kirillov, A. (2013). Aforge. net framework. Retrieved September 25th from http://www. aforgenet. com, 68, 47-52.
Manchanda, P., Gupta, S., & Bhatia, K. K. (2012). On the automated classification of web pages using artificial neural network. IOSRJCE, ISSN, 2278-066.
Octave, G. N. U. (2012). Gnu octave. línea]. Available: http://www. gnu. org/software/octave.
Qiang, G. (2010, May). An effective algorithm for improving the performance of Naive Bayes for text classification. In 2010 Second international conference on computer research and development.
Singh, P., & Budhiraja, S. (2011). Feature extraction and classification techniques in OCR systems for handwritten Gurmukhi Script–a survey. International Journal of Engineering Research and Applications (IJERA), 1(4), 1736-1739.
Simulink, M., & Natick, M. A. (1993). The mathworks.
Sun, W., Liu, L. M., Zhang, W., & Comfort, J. C. (1992). Intelligent OCR processing. Journal of the American Society for Information Science, 43(6), 422-431.
Usta, R. (2014) Naïve Bayes Sınıflandırma Algoritması. Kodedu.com. https://kodedu.com/2014/05/naive-bayes-siniflandirma-algoritmasi/
Wemhoener, D., Yalniz, I. Z., & Manmatha, R. (2013, August). Creating an improved version using noisy OCR from multiple editions. In 2013 12th International Conference on Document Analysis and Recognition (pp. 160-164). IEEE.

There are 12 citations in total.

Details

Primary Language	English
Subjects	Artificial Intelligence
Journal Section	Research Article
Authors	Ömer Aydın 0000-0002-7137-4881
Publication Date	June 1, 2021
Submission Date	January 19, 2021
Acceptance Date	February 26, 2021
Published in Issue	Year 2021 Volume: 6 Issue: 2

Cite

APA	Aydın, Ö. (2021). Classification of Documents Extracted from Images with Optical Character Recognition Methods. Computer Science, 6(2), 46-55.