Identification of OOV words in Turkish texts

Enis Arslan; Umut Orhan

TR EN

Identification of OOV words in Turkish texts

Öz

In this study, we present a semantic graph network model which is capable of detecting out-of-vocabulary (OOV) words in Turkish texts. In natural language processing (NLP) field, morphological analyzers can encounter unknown words (UW) during word processing. This mostly occurs when these kind of tools depend on a dictionary to find the probable lemmas in order to further process parsing. Sometimes, an analyzer is unable to find any candidates because of the non-existence of the lemma candidates in the dictionary. This results in degraded parsing output. The proposed model for OOV detection is able to define OOV words which are suitable for dictionaries. Also co-occurrence relations of the lemmas in texts are modelled as a semantic sub-graph and it is used to discover collocations to propose as new lemma candidates.

Anahtar Kelimeler

Unknown words,Collocation,Co-occurrence,OOV words

Kaynakça

Arısoy, E., Dutağacı, H., Arslan, L.M., 2006. A unified language model for large vocabulary continuous speech recognition of Turkish. Signal Processing, 86(10), pp.2844-2862.
Arısoy, E., Can, D., Parlak, S., Sak, H. and Saraçlar, M., 2009. Turkish broadcast news transcription and retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 17(5), pp.874-883.
Arslan, E, Orhan, U. 2017. Using Graphs in Construction of a Lemmatization Model for Turkish, International Mediteranean Science and Engineering Congress, IMSEC.Asahara, M., Matsumoto, Y., 2004, August. Japanese unknown word identification by character-based chunking. In Proceedings of the 20th international conference on Computational Linguistics (p. 459). Association for Computational Linguistics.
Bazzi, I., Glass, J., 2002. A multi-class approach for modelling out-of-vocabulary words. In Seventh International Conference on Spoken Language Processing.
Brill, E., 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational linguistics, 21(4), pp.543-565.
Çöltekin, Ç., 2014. A set of open source tools for Turkish natural language processing. In LREC (pp. 1079-1086).Daciuk, J., 1999, July. Treatment of unknown words. In International Workshop on Implementing Automata (pp. 71-80). Springer, Berlin, Heidelberg.
Erjavec, T., Džeroski, S., 2004. Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence, 18(1), pp.17-41.
Jongejan, B., Dalianis, H., 2009. August. Automatic training of lemmatization rules that handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1 (pp. 145-153). Association for Computational Linguistics.

Korobov, M., 2015. April. Morphological analyzer and generator for Russian and Ukrainian languages. In International Conference on Analysis of Images, Social Networks and Texts (pp. 320-332). Springer, Cham.
Lafferty, J., McCallum, A. and Pereira, F.C., 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
Loponen, A., Kalervo, J., 2010. A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Berlin, Heidelberg, 2010.
Nakagawa, T., 2004. August. Chinese and Japanese word segmentation using word-level and character-level information. In Proceedings of the 20th international conference on Computational Linguistics (p. 466). Association for Computational Linguistics.
Silfverberg, M., Ruokolainen, T., Lindén, K. and Kurimo, M., 2016. FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish. Language Resources and Evaluation, 50(4), pp.863-878.
Parlak, Siddika, and Murat Saraclar. "Spoken term detection for Turkish broadcast news." Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008.
Parlak, S., Saraclar, M., 2008. March. Spoken term detection for Turkish broadcast news. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (pp. 5244-5247). IEEE.
Tahiroglu, B.T., Akalın, S.H., Ozkan, B., 2014. Turkce Cevrim Ici Haber Metinlerinde Yeni Sozlerin (Neolojizm) Otomatik Çıkarımı. In Turkce Uzerine Derlembilim Uygulamaları, Karahan Kitabevi.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Mühendislik

Bölüm

Araştırma Makalesi

Yazarlar

Enis Arslan ^*
Türkiye

Umut Orhan
Türkiye

Yayımlanma Tarihi

31 Ekim 2019

Gönderilme Tarihi

30 Aralık 2018

Kabul Tarihi

26 Eylül 2019

Yayımlandığı Sayı

Yıl 2019 Cilt: 8 Sayı: 2

IZ

https://izlik.org/JA76LY73PM

Kaynak Göster

RIS / Bibtex

APA

Arslan, E., & Orhan, U. (2019). Identification of OOV words in Turkish texts. Gaziosmanpaşa Bilimsel Araştırma Dergisi, 8(2), 35-48. https://izlik.org/JA76LY73PM

AMA

1.Arslan E, Orhan U. Identification of OOV words in Turkish texts. GBAD. 2019;8(2):35-48. https://izlik.org/JA76LY73PM

Chicago

Arslan, Enis, ve Umut Orhan. 2019. “Identification of OOV words in Turkish texts”. Gaziosmanpaşa Bilimsel Araştırma Dergisi 8 (2): 35-48. https://izlik.org/JA76LY73PM.

EndNote

Arslan E, Orhan U (01 Ekim 2019) Identification of OOV words in Turkish texts. Gaziosmanpaşa Bilimsel Araştırma Dergisi 8 2 35–48.

IEEE

[1]E. Arslan ve U. Orhan, “Identification of OOV words in Turkish texts”, GBAD, c. 8, sy 2, ss. 35–48, Eki. 2019, [çevrimiçi]. Erişim adresi: https://izlik.org/JA76LY73PM

ISNAD

Arslan, Enis - Orhan, Umut. “Identification of OOV words in Turkish texts”. Gaziosmanpaşa Bilimsel Araştırma Dergisi 8/2 (01 Ekim 2019): 35-48. https://izlik.org/JA76LY73PM.

JAMA

1.Arslan E, Orhan U. Identification of OOV words in Turkish texts. GBAD. 2019;8:35–48.

MLA

Arslan, Enis, ve Umut Orhan. “Identification of OOV words in Turkish texts”. Gaziosmanpaşa Bilimsel Araştırma Dergisi, c. 8, sy 2, Ekim 2019, ss. 35-48, https://izlik.org/JA76LY73PM.

Vancouver

1.Enis Arslan, Umut Orhan. Identification of OOV words in Turkish texts. GBAD [Internet]. 01 Ekim 2019;8(2):35-48. Erişim adresi: https://izlik.org/JA76LY73PM

Türkçe metinlerde sözlük dışı kelime tespiti

Öz

Anahtar Kelimeler

Identification of OOV words in Turkish texts

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

IZ

Kaynak Göster