Araştırma Makalesi
BibTex RIS Kaynak Göster

DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama

Yıl 2018, Cilt: 6 Sayı: 1, 68 - 78, 31.01.2018
https://doi.org/10.29130/dubited.330094

Öz







Örüntü
tanıma psikolojiden biyometriye, biyoenformatikten gen ifadelerinin
analizine, trafikten hesaplamalı finansa kadar birçok alanda
kullanılmaktadır. Optik Karakter Tanıma da bu alanlardan bir
tanesidir. Kamu ve özel birçok firma, arşivlerindeki klasörlenmiş
verilerini taratarak dijital hale getirmekte ve bunun için emek
yoğun çalışmalar yapmaktadır. Ancak resim olarak
dijitalleştirilen bu verilerin içerik olarak aranması ve işlenmesi
ancak operatörlerin manuel olarak taranan resim verisine meta veri
eklemesi ile kısmi olarak gerçekleşmektedir. Bu çalışmada,
resim olarak taranarak (eng. scan) ve dijital hale getirilen büyük
miktarlardaki bu dokümanlar üzerinde içerik bazlı figür
aramaları mümkün kılan bir mimari geliştir
dik.
Kullanıcı, bazı anahtar kelimelerle arama yaparak dijital
dökümanlardaki ilgili figürleri başlıklarıyla beraber
görüntüleyebi
lmektedir.
Sistemin
yapılabilirlik ve başarımı farklı veri setleri üzerinde test
edil
miş,
başarı

sonuçlar
elde
edilmiştir.

Kaynakça

  • [1] K. Jung, K. I. Kim ve A. K. Jain, “Text information extraction in images and video: A survey,” Pattern Recognition, vol. 37, no. 5, pp. 977–997, 2004.
  • [2] C. Patrick, C. Francine ve D. Laurent “Picture Detection in Document Page Images,” ACM Symposium on Document Engineering, Manchester, United Kingdom, 2010, pp. 211–214.
  • [3] S. B. Dan ve R. C. Francine, “Extraction of text-related features for condensing image documents,” SPIE 2660, Document Recognition III, San Jose, CA, United States, 1996, pp. 72–88.
  • [4] L. A. Fletcher ve R. Kasturi “A robust algorithm for text string separation from mixed text/graphics images,” IEEE TPAMI, vol. 10, no. 6, pp. 910–918, 1988.
  • [5] C. Najwa-Maria, D. Pascal ve Y. Charles, “A Robust Algorithm for Text Extraction from Images,” 39th International Conference on Telecommunications and Signal Processing, Vienna, Austria, 2016, pp. 493–497.
  • [6] Y. Vikas ve R. Nicolas, “Text extraction in document images: highlight on using corner points,” 12th IAPR Workshop on Document Analysis Systems, Santorini, Greece, 2015, pp. 281–286.
  • [7] F. Shafait, D. Keysers ve T. M. Breue, “Performance evaluation and benchmarking of six page segmentation algorithms,” IEEE TPAMI, vol. 10, no. 6, pp. 941–954, 2008.
  • [8] T. J. Burns ve J. J. Corso, “Robust unsupervised segmentation of degraded document images with topic models,” Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 1287–1294.
  • [9] S. Chuai-Aree, C. Lursinsap, P. Sophatsathit ve S. Siripant, “Fuzzy C-Mean: A Statistical Feature Classification of Text and Image Segmentation Method,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 9, no. 6, pp. 661–671, 2001.
  • [10] A. Srivastav veJ. Kumar, “Text detection in scene images using stroke width and nearest-neighbor constraints,” TENCON 2008, Hyderabad, India, 2008, pp. 1–5.
  • [11] M. Jaderberg, A. Vedaldi ve A. Zisserman, “Deep Features for Text Spotting,” 13th European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 6–12.
  • [12] J. Shi ve J. Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, no. 8, pp. 431–439, 2000.
  • [13] T. Wang, D. J. Wu, A. Coates, ve A. Y. Ng, “End-to-end text recognition with convolution neural networks,” 21st International Conference on Pattern Recognition, Tsukuba, Japan, pp. 3304–3308, 2012.
  • [14] Y. Zhu, J. Sun ve S. Naoi, “Recognizing natural scene characters by convolutional neural network and bimodal image enhancement,” International Workshop on Camera-Based Document Analysis and Recognition, Beijing, China, 2011, pp. 69–82.
  • [15] Tess4J, (17Haziran 2017) [Online]. Erişim: https://github.com/tesseract-ocr/tesseract
  • [16] E. Süleyman, K. G. Fidan, S. Ahmet ve K. Adnan, “Doküman Tabanlı NoSQL Veritabanları: MongoDB ve CouchDB yatay ölçeklenebilirlik karşılaştırması,” 7. Mühendislik ve Teknoloji Sempozyumu, Ankara, Türkiye, pp. 1-7, 2014.

DocDig: Content Based Figure Search in Digitized Documents

Yıl 2018, Cilt: 6 Sayı: 1, 68 - 78, 31.01.2018
https://doi.org/10.29130/dubited.330094

Öz

Pattern recognition is used in many areas, from psychology to biometrics, analysis of gene expressions from bioinformatics, from traffic to finance calculated. Optical Character Recognition is also one of these areas. Many public and private firms digitize their archived data and make labor-intensive studies for this purpose. However, the retrieval and processing of these data, which are digitized as images, is only partially realized by adding metadata to the manually scanned image data. In this work, we developed an architecture that makes contentbased figure searches possible on these scanned documents in large quantities. The user can search with some keywords and display related figures in digital documents with their captions. The feasibility and performance of the system have been tested on different data sets and successful results have been obtained.

Kaynakça

  • [1] K. Jung, K. I. Kim ve A. K. Jain, “Text information extraction in images and video: A survey,” Pattern Recognition, vol. 37, no. 5, pp. 977–997, 2004.
  • [2] C. Patrick, C. Francine ve D. Laurent “Picture Detection in Document Page Images,” ACM Symposium on Document Engineering, Manchester, United Kingdom, 2010, pp. 211–214.
  • [3] S. B. Dan ve R. C. Francine, “Extraction of text-related features for condensing image documents,” SPIE 2660, Document Recognition III, San Jose, CA, United States, 1996, pp. 72–88.
  • [4] L. A. Fletcher ve R. Kasturi “A robust algorithm for text string separation from mixed text/graphics images,” IEEE TPAMI, vol. 10, no. 6, pp. 910–918, 1988.
  • [5] C. Najwa-Maria, D. Pascal ve Y. Charles, “A Robust Algorithm for Text Extraction from Images,” 39th International Conference on Telecommunications and Signal Processing, Vienna, Austria, 2016, pp. 493–497.
  • [6] Y. Vikas ve R. Nicolas, “Text extraction in document images: highlight on using corner points,” 12th IAPR Workshop on Document Analysis Systems, Santorini, Greece, 2015, pp. 281–286.
  • [7] F. Shafait, D. Keysers ve T. M. Breue, “Performance evaluation and benchmarking of six page segmentation algorithms,” IEEE TPAMI, vol. 10, no. 6, pp. 941–954, 2008.
  • [8] T. J. Burns ve J. J. Corso, “Robust unsupervised segmentation of degraded document images with topic models,” Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 1287–1294.
  • [9] S. Chuai-Aree, C. Lursinsap, P. Sophatsathit ve S. Siripant, “Fuzzy C-Mean: A Statistical Feature Classification of Text and Image Segmentation Method,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 9, no. 6, pp. 661–671, 2001.
  • [10] A. Srivastav veJ. Kumar, “Text detection in scene images using stroke width and nearest-neighbor constraints,” TENCON 2008, Hyderabad, India, 2008, pp. 1–5.
  • [11] M. Jaderberg, A. Vedaldi ve A. Zisserman, “Deep Features for Text Spotting,” 13th European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 6–12.
  • [12] J. Shi ve J. Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, no. 8, pp. 431–439, 2000.
  • [13] T. Wang, D. J. Wu, A. Coates, ve A. Y. Ng, “End-to-end text recognition with convolution neural networks,” 21st International Conference on Pattern Recognition, Tsukuba, Japan, pp. 3304–3308, 2012.
  • [14] Y. Zhu, J. Sun ve S. Naoi, “Recognizing natural scene characters by convolutional neural network and bimodal image enhancement,” International Workshop on Camera-Based Document Analysis and Recognition, Beijing, China, 2011, pp. 69–82.
  • [15] Tess4J, (17Haziran 2017) [Online]. Erişim: https://github.com/tesseract-ocr/tesseract
  • [16] E. Süleyman, K. G. Fidan, S. Ahmet ve K. Adnan, “Doküman Tabanlı NoSQL Veritabanları: MongoDB ve CouchDB yatay ölçeklenebilirlik karşılaştırması,” 7. Mühendislik ve Teknoloji Sempozyumu, Ankara, Türkiye, pp. 1-7, 2014.
Toplam 16 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Mühendislik
Bölüm Makaleler
Yazarlar

Süleyman Eken 0000-0001-9488-908X

Burak Atay Bu kişi benim

Büşra Ceren Sönmez Bu kişi benim

Ahmet Sayar

Yayımlanma Tarihi 31 Ocak 2018
Yayımlandığı Sayı Yıl 2018 Cilt: 6 Sayı: 1

Kaynak Göster

APA Eken, S., Atay, B., Sönmez, B. C., Sayar, A. (2018). DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. Duzce University Journal of Science and Technology, 6(1), 68-78. https://doi.org/10.29130/dubited.330094
AMA Eken S, Atay B, Sönmez BC, Sayar A. DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. DÜBİTED. Ocak 2018;6(1):68-78. doi:10.29130/dubited.330094
Chicago Eken, Süleyman, Burak Atay, Büşra Ceren Sönmez, ve Ahmet Sayar. “DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama”. Duzce University Journal of Science and Technology 6, sy. 1 (Ocak 2018): 68-78. https://doi.org/10.29130/dubited.330094.
EndNote Eken S, Atay B, Sönmez BC, Sayar A (01 Ocak 2018) DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. Duzce University Journal of Science and Technology 6 1 68–78.
IEEE S. Eken, B. Atay, B. C. Sönmez, ve A. Sayar, “DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama”, DÜBİTED, c. 6, sy. 1, ss. 68–78, 2018, doi: 10.29130/dubited.330094.
ISNAD Eken, Süleyman vd. “DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama”. Duzce University Journal of Science and Technology 6/1 (Ocak 2018), 68-78. https://doi.org/10.29130/dubited.330094.
JAMA Eken S, Atay B, Sönmez BC, Sayar A. DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. DÜBİTED. 2018;6:68–78.
MLA Eken, Süleyman vd. “DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama”. Duzce University Journal of Science and Technology, c. 6, sy. 1, 2018, ss. 68-78, doi:10.29130/dubited.330094.
Vancouver Eken S, Atay B, Sönmez BC, Sayar A. DocDig: Dijitalleştirilmiş Dokümanlarda İçerik Tabanlı Figür Arama. DÜBİTED. 2018;6(1):68-7.