Osmanlıcadan Türkçeye Uçtan Uca Aktarım

İshak Dölek; Atakan Kurt

Research Article

Osmanlıcadan Türkçeye Uçtan Uca Aktarım

Year 2022, Volume: 3 Issue: 1, 1 - 10, 29.06.2022

Abstract

Bu makalede Osmanlıca Dokümanların Modern Türkçeye Uçtan Uca aktarımı adlı proje sunulmuştur. Devlet arşivleri, kütüphaneleri ve özel koleksiyonlarda milyonlarca Osmanlıca doküman bulunmaktadır. Bunların Modern Türkçeye elle aktarımı mümkün değildir. Osmanlica.com adresinde kullanıma açılan bu projede Osmanlıca dokümanların Türkçe 3 adımda aktarımı yapılmaktadır: (i) Osmanlıca karakter tanıma (OCR) (ii) Osmanlıca-Türkçe Alfabe Çevrisi (iii) Osmanlıca-Türkçe Çeviri. Bildiğimiz kadarıyla, bu proje Osmanlıca-Türkçe aktarım sürecinin üç adımını da çözmeyi hedefleyen ilk projedir. Bu adımların her biri NLP ve Derin Öğrenmede teknik ve bilimsel olarak karmaşık ve kaynak gerektiren problemlerdir. Birinci adımda doküman görüntüleri OCR ile Osmanlı alfabesinde düz metine dönüştürülür. İkinci adımda Arap-tabanlı Osmanlı alfabesindeki bu metin bir alfabe çevrisi sistemiyle Latin-tabanlı Türk alfabesine dönüştürülür. Türk alfabesindeki metin her ne kadar okunabilir olsa da çok sayıda Arapça ve Farsça kelime ve yapı barındırdığı için henüz anlaşılabilir değildir. Üçüncü adım bu metin makine çevirisi ile Modern Türkçeye aktarılır. Birinci adımda geliştirilen CRNN tabanlı OCR modeli 21 sayfalık bir veri setinde test edilmiş ve %96 karakter tanıma doğruluk oranı üretmiştir. İkinci adımda geliştirilen alfabe çeviri sistemi 7500 kelimelik bir veri setiyle test edilmiş ve %98 kelime çeviri doğruluk oranı üretmiştir. Üçüncü adım için kelime grubu tabanlı bir makine çeviri sistemi geliştirilmiş ve testlerine başlanmıştır. Bu projenin önemli bir sosyal, kültürel ve bilimsel probleme katkı sağladığı için değerli bir çalışma olduğunu düşünüyoruz.

Keywords

Osmanlıca OCR , Osmanlıca-Türkçe alfabe çevirisi , Osmanlıca-Türkçe harfçevrim , Osmanlıca-Türkçe dil çevirisi

References

[1] M. ERGİN, “Osmanlıca Dersleri”, İstanbul: Boğaziçi yayınları, 2020.
[2] S. Kirmizialtin ve D. Wrisley, “Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print”, Archive arXiv preprint arXiv:2011.01139, 2020. https://arxiv.org/abs/2011.01139
[3] M. Mohd, F. Qamar, I. Al-Sheikh and R. Salah, “Quranic Optical Text Recognition Using Deep Learning Models”, IEEE Access, vol. 9, pp. 38318-38330, 2021, doi: 10.1109/ACCESS.2021.3064019.
[4] Miletos OCR [Çevrimiçi]. Erişim: www.miletos.com
[5] IRCICA [Çevrimiçi]. Erişim: library.ircica.org.
[6] I. Dolek and A. Kurt, "Ottoman OCR: Printed Naskh Font," 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), 2021, pp. 1-5, doi: 10.1109/INISTA52262.2021.9548616.
[7] A. Kurt, and E. F. Bilgin. “The Outline of an Ottoman-to-Turkish Automatic Machine Transliteration System” First Workshop on Language Resources and Technologies for Turkic Languages. 2012.
[8] J. Korkut, “Morphology and Lexicon-Based Machine Translation of Ottoman Turkish to Modern Turkish”, Princeton University, Princeton, NJ, USA, 2019.
[9] A. A. Jaf, S. K. Kayhan, “Machine-Based Transliterate of Ottoman to Latin-Based Script”, Scientific Programming, vol. 2021, Article ID 7152935, 8 pages, 2021. https://doi.org/10.1155/2021/7152935
[10] E. Özkan and G. Ercan, “Modernization of old turkish texts,”, 26th Signal Processing and Communications Applications Conference (SIU), 2018, pp. 1-4, doi: 10.1109/SIU.2018.8404308.
[11] J. Memon, M. Sami, R. A. Khan and M. Uddin, “Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR),” in IEEE Access, vol. 8, pp. 142642-142668, 2020, doi: 10.1109/ACCESS.2020.3012542.

End-To-End Conversion Ottoman to Turkish

Year 2022, Volume: 3 Issue: 1, 1 - 10, 29.06.2022

İshak Dölek , Atakan Kurt

Abstract

In this paper, a project titled End-To-End Conversion Ottoman Documents to Contemporary Turkish is presented. The state archives, libraries, and private collections contain millions of document written in Ottoman. It is practically impossible to convert all these documents to Modern Turkish manually. In this project which is available at Osmanlica.com Ottoman documents are converted to Modern Turkish in three steps: (i) Ottoman OCR (Optical Character Recognition), (ii) Ottoman-Turkish transliteration, and (iii) Ottoman-Turkish translation. To our knowledge this is the only project to set out to solve all three steps of this conversion to date. Each one of these three steps are technically complex and resource-demanding problems in NLP and deep learning. OCR converts image files to editable text in Ottoman alphabet in the first step. Transliteration tranforms that Ottoman text in Arabic-based Ottoman alphabet to the Latin-based Turkish alphabet making it readable but not yet understandable because of Arabic and Persian words and structures in the second step. In the last step, this Ottoman text in Turkish alphabet is translated to Modern Turkish via machine translation. The CRNN based on model developed in the first step produced %96 OCR accuracy with a 21 pages test document set. The Ottoman-Turkish transliteration system developed yielded %98 accuracy with a test set of 7500 words in the second step. The phase-based Ottoman-Turkish machine translation system developed in the third step is being tested presently. We believe that the contribution of this project is significant because it addresses an important social, cultural and scientific problem.

Keywords

Ottoman OCR , Ottoman-Turkish Transliteration , Ottoman-Turkish Translation

References

[1] M. ERGİN, “Osmanlıca Dersleri”, İstanbul: Boğaziçi yayınları, 2020.
[2] S. Kirmizialtin ve D. Wrisley, “Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print”, Archive arXiv preprint arXiv:2011.01139, 2020. https://arxiv.org/abs/2011.01139
[3] M. Mohd, F. Qamar, I. Al-Sheikh and R. Salah, “Quranic Optical Text Recognition Using Deep Learning Models”, IEEE Access, vol. 9, pp. 38318-38330, 2021, doi: 10.1109/ACCESS.2021.3064019.
[4] Miletos OCR [Çevrimiçi]. Erişim: www.miletos.com
[5] IRCICA [Çevrimiçi]. Erişim: library.ircica.org.
[6] I. Dolek and A. Kurt, "Ottoman OCR: Printed Naskh Font," 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), 2021, pp. 1-5, doi: 10.1109/INISTA52262.2021.9548616.
[7] A. Kurt, and E. F. Bilgin. “The Outline of an Ottoman-to-Turkish Automatic Machine Transliteration System” First Workshop on Language Resources and Technologies for Turkic Languages. 2012.
[8] J. Korkut, “Morphology and Lexicon-Based Machine Translation of Ottoman Turkish to Modern Turkish”, Princeton University, Princeton, NJ, USA, 2019.
[9] A. A. Jaf, S. K. Kayhan, “Machine-Based Transliterate of Ottoman to Latin-Based Script”, Scientific Programming, vol. 2021, Article ID 7152935, 8 pages, 2021. https://doi.org/10.1155/2021/7152935
[10] E. Özkan and G. Ercan, “Modernization of old turkish texts,”, 26th Signal Processing and Communications Applications Conference (SIU), 2018, pp. 1-4, doi: 10.1109/SIU.2018.8404308.
[11] J. Memon, M. Sami, R. A. Khan and M. Uddin, “Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR),” in IEEE Access, vol. 8, pp. 142642-142668, 2020, doi: 10.1109/ACCESS.2020.3012542.

There are 11 citations in total.

Details

Primary Language	Turkish
Subjects	Artificial Intelligence
Journal Section	Research Articles
Authors	İshak Dölek 0000-0002-5823-0103 Atakan Kurt
Publication Date	June 29, 2022
Published in Issue	Year 2022 Volume: 3 Issue: 1

Cite

APA	Dölek, İ., & Kurt, A. (2022). Osmanlıcadan Türkçeye Uçtan Uca Aktarım. Journal of Smart Systems Research, 3(1), 1-10.
AMA	Dölek İ, Kurt A. Osmanlıcadan Türkçeye Uçtan Uca Aktarım. JoinSSR. June 2022;3(1):1-10.
Chicago	Dölek, İshak, and Atakan Kurt. “Osmanlıcadan Türkçeye Uçtan Uca Aktarım”. Journal of Smart Systems Research 3, no. 1 (June 2022): 1-10.
EndNote	Dölek İ, Kurt A (June 1, 2022) Osmanlıcadan Türkçeye Uçtan Uca Aktarım. Journal of Smart Systems Research 3 1 1–10.
IEEE	İ. Dölek and A. Kurt, “Osmanlıcadan Türkçeye Uçtan Uca Aktarım”, JoinSSR, vol. 3, no. 1, pp. 1–10, 2022.
ISNAD	Dölek, İshak - Kurt, Atakan. “Osmanlıcadan Türkçeye Uçtan Uca Aktarım”. Journal of Smart Systems Research 3/1 (June2022), 1-10.
JAMA	Dölek İ, Kurt A. Osmanlıcadan Türkçeye Uçtan Uca Aktarım. JoinSSR. 2022;3:1–10.
MLA	Dölek, İshak and Atakan Kurt. “Osmanlıcadan Türkçeye Uçtan Uca Aktarım”. Journal of Smart Systems Research, vol. 3, no. 1, 2022, pp. 1-10.
Vancouver	Dölek İ, Kurt A. Osmanlıcadan Türkçeye Uçtan Uca Aktarım. JoinSSR. 2022;3(1):1-10.

Download Cover Image

Article Files

Full Text