Research Article
BibTex RIS Cite
Year 2016, Volume: 17 Issue: 2, 401 - 412, 14.07.2016
https://doi.org/10.18038/btda.31812

Abstract

References

  • Solak A, Can F. Effects of stemming on Turkish text retrieval, in Proceedings of the Ninth Int. Symp. on Computer and Information Sciences (ISCIS’94), 1994, pp. 49-56.
  • Ekmekçioglu FC. Lynch MF, Willett P. Stemming and n-gram matching for term conflation in Turkish texts, Information Research News 1996: 7; 2-6.
  • Ekmekçioglu FC, Willett P. Effectiveness of stemming for Turkish text retrieval, PROGRAM- LONDON-ASLIB-2000: 34: 195-200.
  • Sever H, Bitirim Y. FindStem: Analysis and evaluation of a Turkish stemming algorithm, in String Processing and Information Retrieval, 2003: 238-251.
  • Pembe FC, Say ACC. A linguistically motivated information retrieval system for Turkish, in Computer and Information Sciences-ISCIS 2004, ed: Springer, pp. 741-750.
  • Can F, Koçberber S, Balcık E, Kaynak C, Öcalan HÇ, Vursavaş OM. First large-scale information retrieval experiments on Turkish texts, in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 627-628.
  • Can F, Koçberber S, Balcık E, Kaynak C, Öcalan HÇ, Vursavaş OM. Information retrieval on Turkish texts, Journal of the American Society for Information Science and Technology 2008: 59: 407
  • Lovins JB. Development of a stemming algorithm: MIT Information Processing Group, Electronic Systems Laboratory, 1968.
  • Porter MF. An algorithm for suffix stripping, Program, vol. 14, pp. 130-137, 1980.
  • Dolamic L, Savoy J. Stemming approaches for east european languages, in Advances in Multilingual and Multimodal Information Retrieval, ed: Springer, 2008, pp. 37-44.
  • Savoy J. Light stemming approaches for the French, Portuguese, German and Hungarian languages, in Proceedings of the 2006 ACM symposium on Applied computing, 2006, pp. 1031-1035.
  • Popovič M, Willett P. The effectiveness of stemming for natural‐language access to Slovene textual data, Journal of the American Society for Information Science 1992: 43: 384-390.
  • Viera AFG. Virgil J. Uma revisão dos algoritmos de radicalização em língua portuguesa, Information Research 2006: 12: 8-15.
  • Tordai A, De Rijke M. Four stemmers and a funeral: Stemming in Hungarian, Workshop of the Cross-Language Evaluation Forum for European Languages 2005: Springer, 2006, pp. 179-186.
  • Korenius T, Laurikkala J, Järvelin K, Juhola M. Stemming and lemmatization in the clustering of Finnish text documents, in Proceedings of the thirteenth ACM international conference on Information and knowledge management 2004, pp. 625-633.
  • Köksal A. "Bilgi erişim sorunu ve bir belge dizinleme ve erişim dizgesi tasarım ve gerçekleştirimi," ed: Yayınlanmamış Doçentlik Tezi, Hacettepe Üniversitesi, Ankara, Turkey, 1979.
  • Kut A, Alpkoçak A. Özkarahan E. Bilgi bulma sistemleri için otomatik Türkçe dizinleme yöntemi, Bilisim Bildirileri, 12. Ulusal Bilişim Kurultayı 1995, pp. 247-253.
  • Akın MD, Akın AA. Türk dilleri için açık kaynaklı doğal dil işleme kütüphanesi: ZEMBEREK., Elektrik Mühendisliği 2007: 431: 38.
  • Eryiğit G, Adalı E, An Affıx Stripping Morphological Analyzer For Turkish, in the IASTED International Conference on Artificial Intelligence and Applications, Innsbruck, Avusturya, 2004.
  • Dinçer T. A Statistical Information Retrieval System For Turkish, Phd. thesis, Ege Üniversitesi, İzmir, 2004.
  • Oflazer K. Two-level description of Turkish morphology, Literary and linguistic computing 1994: : 137-148.
  • Çilden EK., Stemming Turkish Words Using Snowball, ed: Retrieved, 2014.
  • Mandelbrot BB. On the theory of word frequencies and on related Markovian models of discourse. In: Structure of language and its mathematical aspects 1961, 12. pp. 190-219,
  • Kornai A. How many words are there? Glottometrics 2002: 4: 61-86.
  • Uzun NE. Dünya Dillerinden Örnekleriyle Dilbilgisinin Temel Kavramları Türkçe Üzerine Tartışmalar, Türk Dilleri Araştırmalar Dizisi 39,İstanbul 2004.
  • Hengirmen M. Türkçe temel dilbilgisi: Engin, 1998.
  • Dincer T, Karaoglan B, and Kisla T, A suffix based part-of-speech tagger for turkish, in Information Technology: New Generations, 2008. ITNG 2008. Fifth International Conference on, 2008, pp. 680
  • Oflazer K, Say B, Hakkani-Tür DZ, Tür G. Building a Turkish Treebank, In: Abeille A., Building and Exploiting Syntactically-annotated Corpora, Kluwer Academic Publishers, 2003. pp. 261-277.
  • Atalay NB., Oflazer K, Say B. The annotation process in the Turkish treebank, in Proc. of the 4th Intern. Workshop on Linguistically Interpreteted Corpora (LINC), 2003.
  • Kışla T. An integrated method for morphological analyse and part of speech tagging in Turkish, Phd. Thesis, Intenational Computer Institute, Ege University, Izmir Turkey, 2009.

A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE

Year 2016, Volume: 17 Issue: 2, 401 - 412, 14.07.2016
https://doi.org/10.18038/btda.31812

Abstract

Finding Stem is a complicated and important issue for agglutinative languages like Turkish where theoretically infinite number of surface forms can be obtained from a single lexeme. Both analytical and statistical approaches have been tried for stemming Turkish words. Two main problems apparent with these approaches are the involvement of a dictionary which enforces the assumption of closed vocabulary and the disambiguation of the actual stem among the numerous candidates. Here, we present a method that exploits the simple fact that nouns and verbs have different suffix patterns. Statistical methods are used for stripping off the suffixes. Based on the suffix pattern PoS is determined which then enables the decision for the stem boundary. Thus, the major contribution of the study is the avoiding the disambiguation problem and not using a regular dictionary for stemming. The performance rate for proposed method on golden standard PoS tagged Turkish corpus is 93.83%.

References

  • Solak A, Can F. Effects of stemming on Turkish text retrieval, in Proceedings of the Ninth Int. Symp. on Computer and Information Sciences (ISCIS’94), 1994, pp. 49-56.
  • Ekmekçioglu FC. Lynch MF, Willett P. Stemming and n-gram matching for term conflation in Turkish texts, Information Research News 1996: 7; 2-6.
  • Ekmekçioglu FC, Willett P. Effectiveness of stemming for Turkish text retrieval, PROGRAM- LONDON-ASLIB-2000: 34: 195-200.
  • Sever H, Bitirim Y. FindStem: Analysis and evaluation of a Turkish stemming algorithm, in String Processing and Information Retrieval, 2003: 238-251.
  • Pembe FC, Say ACC. A linguistically motivated information retrieval system for Turkish, in Computer and Information Sciences-ISCIS 2004, ed: Springer, pp. 741-750.
  • Can F, Koçberber S, Balcık E, Kaynak C, Öcalan HÇ, Vursavaş OM. First large-scale information retrieval experiments on Turkish texts, in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 627-628.
  • Can F, Koçberber S, Balcık E, Kaynak C, Öcalan HÇ, Vursavaş OM. Information retrieval on Turkish texts, Journal of the American Society for Information Science and Technology 2008: 59: 407
  • Lovins JB. Development of a stemming algorithm: MIT Information Processing Group, Electronic Systems Laboratory, 1968.
  • Porter MF. An algorithm for suffix stripping, Program, vol. 14, pp. 130-137, 1980.
  • Dolamic L, Savoy J. Stemming approaches for east european languages, in Advances in Multilingual and Multimodal Information Retrieval, ed: Springer, 2008, pp. 37-44.
  • Savoy J. Light stemming approaches for the French, Portuguese, German and Hungarian languages, in Proceedings of the 2006 ACM symposium on Applied computing, 2006, pp. 1031-1035.
  • Popovič M, Willett P. The effectiveness of stemming for natural‐language access to Slovene textual data, Journal of the American Society for Information Science 1992: 43: 384-390.
  • Viera AFG. Virgil J. Uma revisão dos algoritmos de radicalização em língua portuguesa, Information Research 2006: 12: 8-15.
  • Tordai A, De Rijke M. Four stemmers and a funeral: Stemming in Hungarian, Workshop of the Cross-Language Evaluation Forum for European Languages 2005: Springer, 2006, pp. 179-186.
  • Korenius T, Laurikkala J, Järvelin K, Juhola M. Stemming and lemmatization in the clustering of Finnish text documents, in Proceedings of the thirteenth ACM international conference on Information and knowledge management 2004, pp. 625-633.
  • Köksal A. "Bilgi erişim sorunu ve bir belge dizinleme ve erişim dizgesi tasarım ve gerçekleştirimi," ed: Yayınlanmamış Doçentlik Tezi, Hacettepe Üniversitesi, Ankara, Turkey, 1979.
  • Kut A, Alpkoçak A. Özkarahan E. Bilgi bulma sistemleri için otomatik Türkçe dizinleme yöntemi, Bilisim Bildirileri, 12. Ulusal Bilişim Kurultayı 1995, pp. 247-253.
  • Akın MD, Akın AA. Türk dilleri için açık kaynaklı doğal dil işleme kütüphanesi: ZEMBEREK., Elektrik Mühendisliği 2007: 431: 38.
  • Eryiğit G, Adalı E, An Affıx Stripping Morphological Analyzer For Turkish, in the IASTED International Conference on Artificial Intelligence and Applications, Innsbruck, Avusturya, 2004.
  • Dinçer T. A Statistical Information Retrieval System For Turkish, Phd. thesis, Ege Üniversitesi, İzmir, 2004.
  • Oflazer K. Two-level description of Turkish morphology, Literary and linguistic computing 1994: : 137-148.
  • Çilden EK., Stemming Turkish Words Using Snowball, ed: Retrieved, 2014.
  • Mandelbrot BB. On the theory of word frequencies and on related Markovian models of discourse. In: Structure of language and its mathematical aspects 1961, 12. pp. 190-219,
  • Kornai A. How many words are there? Glottometrics 2002: 4: 61-86.
  • Uzun NE. Dünya Dillerinden Örnekleriyle Dilbilgisinin Temel Kavramları Türkçe Üzerine Tartışmalar, Türk Dilleri Araştırmalar Dizisi 39,İstanbul 2004.
  • Hengirmen M. Türkçe temel dilbilgisi: Engin, 1998.
  • Dincer T, Karaoglan B, and Kisla T, A suffix based part-of-speech tagger for turkish, in Information Technology: New Generations, 2008. ITNG 2008. Fifth International Conference on, 2008, pp. 680
  • Oflazer K, Say B, Hakkani-Tür DZ, Tür G. Building a Turkish Treebank, In: Abeille A., Building and Exploiting Syntactically-annotated Corpora, Kluwer Academic Publishers, 2003. pp. 261-277.
  • Atalay NB., Oflazer K, Say B. The annotation process in the Turkish treebank, in Proc. of the 4th Intern. Workshop on Linguistically Interpreteted Corpora (LINC), 2003.
  • Kışla T. An integrated method for morphological analyse and part of speech tagging in Turkish, Phd. Thesis, Intenational Computer Institute, Ege University, Izmir Turkey, 2009.
There are 30 citations in total.

Details

Primary Language English
Subjects Engineering
Journal Section Articles
Authors

Tarık Kışla

Bahar Karaoğlan

Publication Date July 14, 2016
Published in Issue Year 2016 Volume: 17 Issue: 2

Cite

APA Kışla, T., & Karaoğlan, B. (2016). A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering, 17(2), 401-412. https://doi.org/10.18038/btda.31812
AMA Kışla T, Karaoğlan B. A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. AUJST-A. August 2016;17(2):401-412. doi:10.18038/btda.31812
Chicago Kışla, Tarık, and Bahar Karaoğlan. “A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 17, no. 2 (August 2016): 401-12. https://doi.org/10.18038/btda.31812.
EndNote Kışla T, Karaoğlan B (August 1, 2016) A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 17 2 401–412.
IEEE T. Kışla and B. Karaoğlan, “A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE”, AUJST-A, vol. 17, no. 2, pp. 401–412, 2016, doi: 10.18038/btda.31812.
ISNAD Kışla, Tarık - Karaoğlan, Bahar. “A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 17/2 (August 2016), 401-412. https://doi.org/10.18038/btda.31812.
JAMA Kışla T, Karaoğlan B. A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. AUJST-A. 2016;17:401–412.
MLA Kışla, Tarık and Bahar Karaoğlan. “A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering, vol. 17, no. 2, 2016, pp. 401-12, doi:10.18038/btda.31812.
Vancouver Kışla T, Karaoğlan B. A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. AUJST-A. 2016;17(2):401-12.