DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama

Süleyman Eken; Burak Atay; Büşra Ceren Sönmez; Ahmet Sayar

doi:10.29130/dubited.330094

Araştırma Makalesi

DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama

Yıl 2018, Cilt: 6 Sayı: 1, 68 - 78, 31.01.2018

Süleyman Eken , Burak Atay Büşra Ceren Sönmez Ahmet Sayar

https://doi.org/10.29130/dubited.330094

Cited By: 3

Öz

Örüntü
tanıma psikolojiden biyometriye, biyoenformatikten gen ifadelerinin
analizine, trafikten hesaplamalı finansa kadar birçok alanda
kullanılmaktadır. Optik Karakter Tanıma da bu alanlardan bir
tanesidir. Kamu ve özel birçok firma, arşivlerindeki klasörlenmiş
verilerini taratarak dijital hale getirmekte ve bunun için emek
yoğun çalışmalar yapmaktadır. Ancak resim olarak
dijitalleştirilen bu verilerin içerik olarak aranması ve işlenmesi
ancak operatörlerin manuel olarak taranan resim verisine meta veri
eklemesi ile kısmi olarak gerçekleşmektedir. Bu çalışmada,
resim olarak taranarak (eng. scan) ve dijital hale getirilen büyük
miktarlardaki bu dokümanlar üzerinde içerik bazlı figür
aramaları mümkün kılan bir mimari geliştirdik.
Kullanıcı, bazı anahtar kelimelerle arama yaparak dijital
dökümanlardaki ilgili figürleri başlıklarıyla beraber
görüntüleyebilmektedir.
Sistemin
yapılabilirlik ve başarımı farklı veri setleri üzerinde test
edilmiş,
başarılı
sonuçlar
elde
edilmiştir.

Anahtar Kelimeler

Doküman dijitalleştirme , figür/resim saptama , başlık saptama , içerik tabanlı arama

Kaynakça

[1] K. Jung, K. I. Kim ve A. K. Jain, “Text information extraction in images and video: A survey,” Pattern Recognition, vol. 37, no. 5, pp. 977–997, 2004.
[2] C. Patrick, C. Francine ve D. Laurent “Picture Detection in Document Page Images,” ACM Symposium on Document Engineering, Manchester, United Kingdom, 2010, pp. 211–214.
[3] S. B. Dan ve R. C. Francine, “Extraction of text-related features for condensing image documents,” SPIE 2660, Document Recognition III, San Jose, CA, United States, 1996, pp. 72–88.
[4] L. A. Fletcher ve R. Kasturi “A robust algorithm for text string separation from mixed text/graphics images,” IEEE TPAMI, vol. 10, no. 6, pp. 910–918, 1988.
[5] C. Najwa-Maria, D. Pascal ve Y. Charles, “A Robust Algorithm for Text Extraction from Images,” 39th International Conference on Telecommunications and Signal Processing, Vienna, Austria, 2016, pp. 493–497.
[6] Y. Vikas ve R. Nicolas, “Text extraction in document images: highlight on using corner points,” 12th IAPR Workshop on Document Analysis Systems, Santorini, Greece, 2015, pp. 281–286.
[7] F. Shafait, D. Keysers ve T. M. Breue, “Performance evaluation and benchmarking of six page segmentation algorithms,” IEEE TPAMI, vol. 10, no. 6, pp. 941–954, 2008.
[8] T. J. Burns ve J. J. Corso, “Robust unsupervised segmentation of degraded document images with topic models,” Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 1287–1294.
[9] S. Chuai-Aree, C. Lursinsap, P. Sophatsathit ve S. Siripant, “Fuzzy C-Mean: A Statistical Feature Classification of Text and Image Segmentation Method,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 9, no. 6, pp. 661–671, 2001.
[10] A. Srivastav veJ. Kumar, “Text detection in scene images using stroke width and nearest-neighbor constraints,” TENCON 2008, Hyderabad, India, 2008, pp. 1–5.
[11] M. Jaderberg, A. Vedaldi ve A. Zisserman, “Deep Features for Text Spotting,” 13th European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 6–12.
[12] J. Shi ve J. Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, no. 8, pp. 431–439, 2000.
[13] T. Wang, D. J. Wu, A. Coates, ve A. Y. Ng, “End-to-end text recognition with convolution neural networks,” 21st International Conference on Pattern Recognition, Tsukuba, Japan, pp. 3304–3308, 2012.
[14] Y. Zhu, J. Sun ve S. Naoi, “Recognizing natural scene characters by convolutional neural network and bimodal image enhancement,” International Workshop on Camera-Based Document Analysis and Recognition, Beijing, China, 2011, pp. 69–82.
[15] Tess4J, (17Haziran 2017) [Online]. Erişim: https://github.com/tesseract-ocr/tesseract
[16] E. Süleyman, K. G. Fidan, S. Ahmet ve K. Adnan, “Doküman Tabanlı NoSQL Veritabanları: MongoDB ve CouchDB yatay ölçeklenebilirlik karşılaştırması,” 7. Mühendislik ve Teknoloji Sempozyumu, Ankara, Türkiye, pp. 1-7, 2014.

DocDig: Content Based Figure Search in Digitized Documents

Yıl 2018, Cilt: 6 Sayı: 1, 68 - 78, 31.01.2018

Süleyman Eken , Burak Atay Büşra Ceren Sönmez Ahmet Sayar

https://doi.org/10.29130/dubited.330094

Cited By: 3

Öz

Pattern recognition is used in many areas, from psychology to biometrics, analysis of gene expressions from bioinformatics, from traffic to finance calculated. Optical Character Recognition is also one of these areas. Many public and private firms digitize their archived data and make labor-intensive studies for this purpose. However, the retrieval and processing of these data, which are digitized as images, is only partially realized by adding metadata to the manually scanned image data. In this work, we developed an architecture that makes contentbased figure searches possible on these scanned documents in large quantities. The user can search with some keywords and display related figures in digital documents with their captions. The feasibility and performance of the system have been tested on different data sets and successful results have been obtained.

Anahtar Kelimeler

Document digitization , Figure/picture detection , Caption detection , Content based search , MongoDB

Kaynakça

[1] K. Jung, K. I. Kim ve A. K. Jain, “Text information extraction in images and video: A survey,” Pattern Recognition, vol. 37, no. 5, pp. 977–997, 2004.
[2] C. Patrick, C. Francine ve D. Laurent “Picture Detection in Document Page Images,” ACM Symposium on Document Engineering, Manchester, United Kingdom, 2010, pp. 211–214.
[3] S. B. Dan ve R. C. Francine, “Extraction of text-related features for condensing image documents,” SPIE 2660, Document Recognition III, San Jose, CA, United States, 1996, pp. 72–88.
[4] L. A. Fletcher ve R. Kasturi “A robust algorithm for text string separation from mixed text/graphics images,” IEEE TPAMI, vol. 10, no. 6, pp. 910–918, 1988.
[5] C. Najwa-Maria, D. Pascal ve Y. Charles, “A Robust Algorithm for Text Extraction from Images,” 39th International Conference on Telecommunications and Signal Processing, Vienna, Austria, 2016, pp. 493–497.
[6] Y. Vikas ve R. Nicolas, “Text extraction in document images: highlight on using corner points,” 12th IAPR Workshop on Document Analysis Systems, Santorini, Greece, 2015, pp. 281–286.
[7] F. Shafait, D. Keysers ve T. M. Breue, “Performance evaluation and benchmarking of six page segmentation algorithms,” IEEE TPAMI, vol. 10, no. 6, pp. 941–954, 2008.
[8] T. J. Burns ve J. J. Corso, “Robust unsupervised segmentation of degraded document images with topic models,” Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 1287–1294.
[9] S. Chuai-Aree, C. Lursinsap, P. Sophatsathit ve S. Siripant, “Fuzzy C-Mean: A Statistical Feature Classification of Text and Image Segmentation Method,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 9, no. 6, pp. 661–671, 2001.
[10] A. Srivastav veJ. Kumar, “Text detection in scene images using stroke width and nearest-neighbor constraints,” TENCON 2008, Hyderabad, India, 2008, pp. 1–5.
[11] M. Jaderberg, A. Vedaldi ve A. Zisserman, “Deep Features for Text Spotting,” 13th European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 6–12.
[12] J. Shi ve J. Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, no. 8, pp. 431–439, 2000.
[13] T. Wang, D. J. Wu, A. Coates, ve A. Y. Ng, “End-to-end text recognition with convolution neural networks,” 21st International Conference on Pattern Recognition, Tsukuba, Japan, pp. 3304–3308, 2012.
[14] Y. Zhu, J. Sun ve S. Naoi, “Recognizing natural scene characters by convolutional neural network and bimodal image enhancement,” International Workshop on Camera-Based Document Analysis and Recognition, Beijing, China, 2011, pp. 69–82.
[15] Tess4J, (17Haziran 2017) [Online]. Erişim: https://github.com/tesseract-ocr/tesseract
[16] E. Süleyman, K. G. Fidan, S. Ahmet ve K. Adnan, “Doküman Tabanlı NoSQL Veritabanları: MongoDB ve CouchDB yatay ölçeklenebilirlik karşılaştırması,” 7. Mühendislik ve Teknoloji Sempozyumu, Ankara, Türkiye, pp. 1-7, 2014.

Toplam 16 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Mühendislik
Bölüm	Makaleler
Yazarlar	Süleyman Eken 0000-0001-9488-908X Burak Atay Bu kişi benim Büşra Ceren Sönmez Bu kişi benim Ahmet Sayar
Yayımlanma Tarihi	31 Ocak 2018
Yayımlandığı Sayı	Yıl 2018 Cilt: 6 Sayı: 1

Kaynak Göster

APA	Eken, S., Atay, B., Sönmez, B. C., Sayar, A. (2018). DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. Duzce University Journal of Science and Technology, 6(1), 68-78. https://doi.org/10.29130/dubited.330094
AMA	Eken S, Atay B, Sönmez BC, Sayar A. DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. DÜBİTED. Ocak 2018;6(1):68-78. doi:10.29130/dubited.330094
Chicago	Eken, Süleyman, Burak Atay, Büşra Ceren Sönmez, ve Ahmet Sayar. “DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama”. Duzce University Journal of Science and Technology 6, sy. 1 (Ocak 2018): 68-78. https://doi.org/10.29130/dubited.330094.
EndNote	Eken S, Atay B, Sönmez BC, Sayar A (01 Ocak 2018) DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. Duzce University Journal of Science and Technology 6 1 68–78.
IEEE	S. Eken, B. Atay, B. C. Sönmez, ve A. Sayar, “DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama”, DÜBİTED, c. 6, sy. 1, ss. 68–78, 2018, doi: 10.29130/dubited.330094.
ISNAD	Eken, Süleyman vd. “DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama”. Duzce University Journal of Science and Technology 6/1 (Ocak2018), 68-78. https://doi.org/10.29130/dubited.330094.
JAMA	Eken S, Atay B, Sönmez BC, Sayar A. DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. DÜBİTED. 2018;6:68–78.
MLA	Eken, Süleyman vd. “DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama”. Duzce University Journal of Science and Technology, c. 6, sy. 1, 2018, ss. 68-78, doi:10.29130/dubited.330094.
Vancouver	Eken S, Atay B, Sönmez BC, Sayar A. DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. DÜBİTED. 2018;6(1):68-7.

Düzce Üniversitesi Bilim ve Teknoloji Dergisi

DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama

Öz

Anahtar Kelimeler

Kaynakça

DocDig: Content Based Figure Search in Digitized Documents

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Kaynak Göster

Cited By

Searchable Turkish OCRed historical newspaper collection 1928–1942

Journal of Information Science

Houssem Menhour

https://doi.org/10.1177/01655515211000642

Figure search by text in large scale digital document collections

Concurrency and Computation: Practice and Experience

M. Mücahit Enes Yurtsever

https://doi.org/10.1002/cpe.6529

Multi-Class Document Image Classification using Deep Visual and Textual Features

International Journal of Computational Intelligence and Applications

https://doi.org/10.1142/S1469026822500134