Identification of OOV words in Turkish texts
Öz
In this study, we present a semantic graph network
model which is capable of detecting out-of-vocabulary (OOV) words in Turkish
texts. In natural language processing (NLP) field, morphological analyzers can
encounter unknown words (UW) during word processing. This mostly occurs when
these kind of tools depend on a dictionary to find the probable lemmas in order
to further process parsing.
Sometimes, an analyzer is unable to find any candidates because of the
non-existence of the lemma candidates in the dictionary. This results in
degraded parsing output. The proposed model for OOV detection is able to define
OOV words which are suitable for dictionaries. Also co-occurrence relations of
the lemmas in texts are modelled as a semantic sub-graph and it is used to
discover collocations to propose as new lemma candidates.
Anahtar Kelimeler
Kaynakça
- Arısoy, E., Dutağacı, H., Arslan, L.M., 2006. A unified language model for large vocabulary continuous speech recognition of Turkish. Signal Processing, 86(10), pp.2844-2862.
- Arısoy, E., Can, D., Parlak, S., Sak, H. and Saraçlar, M., 2009. Turkish broadcast news transcription and retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 17(5), pp.874-883.
- Arslan, E, Orhan, U. 2017. Using Graphs in Construction of a Lemmatization Model for Turkish, International Mediteranean Science and Engineering Congress, IMSEC.Asahara, M., Matsumoto, Y., 2004, August. Japanese unknown word identification by character-based chunking. In Proceedings of the 20th international conference on Computational Linguistics (p. 459). Association for Computational Linguistics.
- Bazzi, I., Glass, J., 2002. A multi-class approach for modelling out-of-vocabulary words. In Seventh International Conference on Spoken Language Processing.
- Brill, E., 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational linguistics, 21(4), pp.543-565.
- Çöltekin, Ç., 2014. A set of open source tools for Turkish natural language processing. In LREC (pp. 1079-1086).Daciuk, J., 1999, July. Treatment of unknown words. In International Workshop on Implementing Automata (pp. 71-80). Springer, Berlin, Heidelberg.
- Erjavec, T., Džeroski, S., 2004. Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence, 18(1), pp.17-41.
- Jongejan, B., Dalianis, H., 2009. August. Automatic training of lemmatization rules that handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1 (pp. 145-153). Association for Computational Linguistics.
Ayrıntılar
Birincil Dil
İngilizce
Konular
Mühendislik
Bölüm
Araştırma Makalesi
Yayımlanma Tarihi
31 Ekim 2019
Gönderilme Tarihi
30 Aralık 2018
Kabul Tarihi
26 Eylül 2019
Yayımlandığı Sayı
Yıl 2019 Cilt: 8 Sayı: 2