meuded

Mersin Üniversitesi Dil ve Edebiyat Dergisi

1304-6594 2149-0856

Mersin Üniversitesi

Computational Linguistics Lexicography and Semantics

Hesaplamalı Dilbilim Sözlükbilim ve Anlambilim

Makinece Okunabilir Çok Dilli Sözlükler: Yarı Otomatik Bir ISO–TEI Modeli

Machine-Readable Multilingual Dictionaries: A Semi-Automatic ISO–TEI Model

https://orcid.org/0000-0001-9340-5485

Özcan

Emrah

YILDIZ TECHNICAL UNIVERSITY

https://orcid.org/0000-0002-8278-2805

Erkoç

Mehmet Fatih

YILDIZ TECHNICAL UNIVERSITY

https://orcid.org/0000-0002-9231-4191

Tokatlı

Hasan

YILDIZ TECHNICAL UNIVERSITY

02 27 2026

22 1 1 20 02 03 2026 02 23 2026

2004

Mersin Üniversitesi Dil ve Edebiyat Dergisi

Sözlükbilimin dijital dönüşümü, alanı statik ve insan tarafından okunabilir ürünler üretmekten, dinamik ve makine tarafından okunabilir veritabanları geliştirmeye doğru köklü biçimde dönüştürmüştür. Bu makale, özellikle ara (pivot) diller aracılığıyla yarı otomatik türetime odaklanarak, makine tarafından okunabilir çok dilli sözlüklerin (MRD) geliştirilmesine ilişkin kuramsal ve yöntemsel temelleri incelemektedir. Yürüttüğümüz bir araştırma projesinin deneysel bulgularına dayanarak, karmaşık sözlüksel verilerin modellenmesinde Text Encoding Initiative (TEI) yönergeleri ile ISO 24613 (Lexical Markup Framework) standardının kullanım etkinliğini analiz ediyoruz. Veritabanı modellerine yönelik tarihsel eleştirileri, sözlüksel birlikte çalışabilirliğe ilişkin güncel standartlarla sentezleyerek; çok dilli sözlüklerin tamamen otomatik olarak türetilmesinin hâlen anlamsal belirsizliklerle sorunlu olduğunu, buna karşılık sağlam bir veri modellemesine ve insan doğrulamasına dayanan yarı otomatik bir iş akışının, birçok dil çifti için geçerli olan kaynak kıtlığını aşmada ölçeklenebilir bir çözüm sunduğunu savunuyoruz.

The digital transformation of lexicography has fundamentally shifted the discipline from the production of static, human-readable artefacts to the creation of dynamic, machine-readable databases. This article examines the theoretical and methodological foundations of developing machine-readable multilingual dictionaries (MRDs), with particular emphasis on semi-automatic derivation via pivot languages. Drawing upon the experimental results of a research project we conducted, we analyse the efficacy of utilising the Text Encoding Initiative (TEI) guidelines and ISO-24613 (Lexical Markup Framework) to model complex lexical data. By synthesising historical critiques of database models with contemporary standards for lexical interoperability, we argue that, while fully automatic induction of multilingual lexicons remains fraught with semantic ambiguity, a semi-automatic workflow, grounded in rigorous data modelling and human verification, offers a scalable solution to overcome the resource scarcity inherent in many language pairs.

Computational lexicography Machine-readable dictionaries (MRD) Text Encoding Initiative (TEI) ISO 24613 (LMF) Multilingual dictionaries Pivot language Semi-automatic modelling

Hesaplamalı sözlükbilim Makinece okunabilir sözlükler (MRD) Metin kodlama girişimi (TEI) ISO 24613 (LMF) Çokdilli sözlükler Ara (pivot) dil Yarı otomatik modelleme

Yildiz Technical University Scientific Research Projects Coordination Unit

SBA-2021-4256

Arhar Holdt, S., & Kosem, I. (2025). Using large language models to generate distractors for language games. In Proceedings of the eLex 2025 Conference (pp. 620–635). Bled, Slovenia.

Aydın, C. R., Erkan, A., Güngör, T., & Takçı, H. (2014). Sözlük kullanarak Türkçe için kavram madenciliği metotları geliştirme: Bir uygulama. In Proceedings of XVI. Academic Informatics Conference (pp. 801–810). Mersin, Türkiye.

Boguraev, B., Briscoe, T., Carroll, J., & Copestake, A. (1990). Database models for computational lexicography. In Proceedings of the 4th International Congress on Lexicography (pp. 59–78). Malaga, Spain.

Chiarcos, C., Ionov, M., Apostol, E.-S., Gkirtzou, K., Kabashi, B., Khan, A. F., & Truică, C.-O. (2024). Multiword expressions, collocations and the OntoLex vocabulary. In Multiword expressions in lexical resources, 187-227. Language Science Press. https://doi.org/10.5281/zenodo.10998641

De Schryver, G.-M. (2010). State-of-the-art software to support intelligent lexicography. In R. Zhu (Ed.), 中華字典研究-第2輯(上下)-2009《康熙字典》曁詞典学国際学術研討会論文集 2, 584–599. 中国社会科学 = China Sociale Wetenschappen Publishing House.

De Schryver, G.M. (2023). Generative AI and Lexicography: The Current State of the Art Using ChatGPT. International Journal of Lexicography, 36(4), 355-387.

Francopoulo, G., & Huang, C.-R. (2014). Lexical markup framework: An ISO standard for electronic lexicons and its implications for Asian languages. Lexicography ASIALEX, 1, 37-51. https://doi.org/10.1007/s40607-014-0006-z

Gantar, P. (2020). Dictionary of Modern Slovene: From Slovene lexical database to digital dictionary database. Rasprave, 46(2), 589-602.

Gillis-Webber, F. (2018). Conversion of the English-Xhosa dictionary for nurses to a linguistic linked data framework. Information, 9(11), 274. https://doi.org/10.3390/info9110274

Ide, N. & Veronis, J. (1993). Extracting knowledge bases from machine-readable dictionaries: Have we wasted our time?, Knowledge Bases & Knowledge Structures 93, Tokyo, 257-266.

Ide, N., & Veronis, J. (1995). Encoding dictionaries. In N. Ide and J. Veronis (Eds.), The Text Encoding Initiative: Background and Context, special triple issue of Computers and the Humanities, 29(2), 167-180.

Ide, N., Kilgarriff, A., & Romary, L. (2000). A formal model of dictionary structure and content. Euralex 2000 Proceedings, 113-126.

Jarrar, M., & Amayreh, H. (2019). An Arabic-Multilingual Database with a Lexicographic Search Engine. In Lecture notes in computer science, 234-246. Springer International Publishing. https://doi.org/10.1007/978-3-030-23281-8_19

Khemakhem, M., Herold, A., & Romary, L. (2018). Enhancing usability for automatically structuring digitised dictionaries. GLOBALEX workshop at LREC 2018, May 2018, Miyazaki, Japan. https://hal.science/hal-01708137v1

Kovarikova, D. (2021). Sharing data through specialized corpus-based tools: The case of GramatiKat. Journal of Linguistics/Jazykovedný casopis. 72, 531-544. https://doi.org/10.2478/jazcas-2021-0049.

Krek, S., Ponikvar, P., Repar, A., Kosem, I., and Lindemann, D. (2025). DMLEX on Wikibase: Legacy dictionaries as collaboratively editable dataset Proceedings of the eLex 2025 conference, Bled, Slovenia, 175-189.

Lemnitzer, L., Romary, L., & Witt, A. (2013). Representing human and machine dictionaries in markup languages. In R. Gouws, U. Heid, W. Schweickard, & H. E. Wiegand (Eds.), HSK - Dictionaries. An international encyclopedia of lexicography: Supplementary volume: Recent developments with special focus on computational lexicography (Vol. 5.4, pp. 1195–1208). Mouton de Gruyter.

Lindemann, D. (2025). Ontolex-Lemon in Wikidata and other Wikibase instances. Proceedings of the 5th Conference on Language, Data and Knowledge: Workshops, 287–297. https://doi.org/10.5281/zenodo.15861038

Mechura, M. (2017). Introducing Lexonomy: An open-source dictionary writing and publishing system. Proceedings of eLex 2017 Conference, 662-679.

Nasution, A. H., Murakami, Y., & Ishida, T. (2017). Plan optimization for creating bilingual dictionaries of low-resource languages. Proceedings of IEEE International Conference on Culture and Computing, 35-41.

Rabe, M., Puttkammer, M. J., & van Huyssteen, G. B. (2025). Compiling a Candidate List of Taboo Constructions for an Under-Resourced Language. Proceedings of the eLex 2025 conference, 739-756.

Romary, L., & Wegstein, W. (2012). Consistent modeling of Heterogeneous Lexical Structures. Journal of the Text Encoding Initiative, 3, 1-43.

Romary, L., Khemakhem, M., Khan, F., Bowers, J., Calzolari, N., George, M., Pet, M., & Bański, P. (2019). LMF reloaded. Proceedings of the AsiaLex 2019 Conference, 533–539.

Rundell, M. (2023). Automating the creation of dictionaries: Are we nearly there? proceedings of the 16th International Conference of the Asian Association for Lexicography (ASIALEX 2023), 9-17.

Saveski, M., & Trajkovski, I. (2011). Development of an English–Macedonian machine readable dictionary by using parallel corpora. In M. Gusev & P. Mitrevski (Eds.), ICT Innovations 2010. Communications in Computer and Information Science (Vol. 83, pp. 207–218). Springer. https://doi.org/10.1007/978-3-642-19325-5_20

Sobkowiak, W. (1996). Phonetic Transcription in Machine-readable Dictionaries. Proceedings of the 7th EURALEX International Congress (EURALEX ’96), 181-188.

Stöckle, P., Elsner, D., Koppensteiner, W., & Korecky-Kröll, K. (2025). LLM-assisted dialect lexicography: Challenges and opportunities in processing historical Bavarian dialects. Proceedings of the eLex 2025 Conference, 453–475.

TEI Consortium. (2012). TEI P5: Guidelines for electronic text encoding and interchange.

Tiberius, C., Heylen, K., De Does, J., Vanroy, B., Vandeghinste, V., & Van Doeselaar, J. (2024). LLMs and evidence-based lexicography: Pilot studies at INT. In S. Krek (Ed.), Book of abstracts of the workshop Large Language Models and Lexicography (pp. 44–48).

Veronis, J., & Ide, N. (1991). An assessment of semantic information automatically extracted from machine readable dictionaries. In J. Kunze & D. Reimann (Eds.), Fifth Conference of the European Chapter of the Association for Computational Linguistics (pp. 227–232). Association for Computational Linguistics. https://aclanthology.org/E91-1040/

Vulić, I., De Smet, W., & Moens, M.-F. (2012). Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Information Retrieval, 16(3), 331–368. https://doi.org/10.1007/s10791-012-9200-5