Research Article

A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE

Volume: 17 Number: 2 July 14, 2016
EN

A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE

Abstract

Finding Stem is a complicated and important issue for agglutinative languages like Turkish where theoretically infinite number of surface forms can be obtained from a single lexeme. Both analytical and statistical approaches have been tried for stemming Turkish words. Two main problems apparent with these approaches are the involvement of a dictionary which enforces the assumption of closed vocabulary and the disambiguation of the actual stem among the numerous candidates. Here, we present a method that exploits the simple fact that nouns and verbs have different suffix patterns. Statistical methods are used for stripping off the suffixes. Based on the suffix pattern PoS is determined which then enables the decision for the stem boundary. Thus, the major contribution of the study is the avoiding the disambiguation problem and not using a regular dictionary for stemming. The performance rate for proposed method on golden standard PoS tagged Turkish corpus is 93.83%.

Keywords

References

  1. Solak A, Can F. Effects of stemming on Turkish text retrieval, in Proceedings of the Ninth Int. Symp. on Computer and Information Sciences (ISCIS’94), 1994, pp. 49-56.
  2. Ekmekçioglu FC. Lynch MF, Willett P. Stemming and n-gram matching for term conflation in Turkish texts, Information Research News 1996: 7; 2-6.
  3. Ekmekçioglu FC, Willett P. Effectiveness of stemming for Turkish text retrieval, PROGRAM- LONDON-ASLIB-2000: 34: 195-200.
  4. Sever H, Bitirim Y. FindStem: Analysis and evaluation of a Turkish stemming algorithm, in String Processing and Information Retrieval, 2003: 238-251.
  5. Pembe FC, Say ACC. A linguistically motivated information retrieval system for Turkish, in Computer and Information Sciences-ISCIS 2004, ed: Springer, pp. 741-750.
  6. Can F, Koçberber S, Balcık E, Kaynak C, Öcalan HÇ, Vursavaş OM. First large-scale information retrieval experiments on Turkish texts, in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 627-628.
  7. Can F, Koçberber S, Balcık E, Kaynak C, Öcalan HÇ, Vursavaş OM. Information retrieval on Turkish texts, Journal of the American Society for Information Science and Technology 2008: 59: 407
  8. Lovins JB. Development of a stemming algorithm: MIT Information Processing Group, Electronic Systems Laboratory, 1968.

Details

Primary Language

English

Subjects

Engineering

Journal Section

Research Article

Publication Date

July 14, 2016

Submission Date

January 7, 2016

Acceptance Date

-

Published in Issue

Year 2016 Volume: 17 Number: 2

APA
Kışla, T., & Karaoğlan, B. (2016). A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering, 17(2), 401-412. https://doi.org/10.18038/btda.31812
AMA
1.Kışla T, Karaoğlan B. A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. AUJST-A. 2016;17(2):401-412. doi:10.18038/btda.31812
Chicago
Kışla, Tarık, and Bahar Karaoğlan. 2016. “A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 17 (2): 401-12. https://doi.org/10.18038/btda.31812.
EndNote
Kışla T, Karaoğlan B (August 1, 2016) A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 17 2 401–412.
IEEE
[1]T. Kışla and B. Karaoğlan, “A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE”, AUJST-A, vol. 17, no. 2, pp. 401–412, Aug. 2016, doi: 10.18038/btda.31812.
ISNAD
Kışla, Tarık - Karaoğlan, Bahar. “A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 17/2 (August 1, 2016): 401-412. https://doi.org/10.18038/btda.31812.
JAMA
1.Kışla T, Karaoğlan B. A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. AUJST-A. 2016;17:401–412.
MLA
Kışla, Tarık, and Bahar Karaoğlan. “A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering, vol. 17, no. 2, Aug. 2016, pp. 401-12, doi:10.18038/btda.31812.
Vancouver
1.Tarık Kışla, Bahar Karaoğlan. A HYBRID STATISTICAL APPROACH TO STEMMING IN TURKISH: AN AGGLUTINATIVE LANGUAGE. AUJST-A. 2016 Aug. 1;17(2):401-12. doi:10.18038/btda.31812

Cited By