LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS

Volume: 13 Number: 2 December 25, 2013
EN

LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS

Abstract

The rising opportunities of communication provided us with many documents in many different languages. Language identification has a key role for these documents to be understandable and to study natural language identification procedures. The increasing number of the documents and international communication requirements make new works on language identification obligatory. Until today, there have been a great number of studies on solving language identification problem about document based language identification. In these studies, characters, words and n-gram sequences have been used with machine learning techniques. In this study, sequence of n-gram frequencies will be used and using of the five different classification algorithms' accuracy performances will be analyzed via different sizes of documents belonging to 15 different languages. N-gram based feature method will be used to extract feature vector belonging to languages. The most appropriate method for the problem of language identification will be identified by comparing the performances of the Support Vector Machines, Multilayer Perceptron, Centroid Classifier, k-Means and Fuzzy C Means methods. During the experiments, trainining and testing data will be selected from ECI multilingual corpus.

Keywords

References

  1. T. Dunning,“Statistical identification of language”, Technical Report CRL Technical Memo MCCS-94-273, University of New Mexico, ESANN.99, Belgium, 1994.
  2. Y. Ueda,, S. Nakagawa,, “Prediction for phoneme/syllable/word-category and identification of language using HMM.”, International Conference on Spoken Language Processing, volume 2, pages 12091212,1990.
  3. C. Manning,, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, Boston, 1994.
  4. European Corpus Initiative Multilingual Corpus (ECI/MCI)(2005),http://www.elsnet.org/resources/eciC orps. html, Page last modified 29-03-2005.
  5. G. Grefenstette,, “Comparing two language identification schemes”, Proceedings of JADT 3rd International Conference on Statistical Analys is of Textual Data, 1995.
  6. W.B. Cavnar, J.M. Trenkle, ”N-gram-based text categorization”, Proceedings of SDAIR-94, 3rd A n n u a l Symposium on Document Analys is and Information Retrieval, Las Vegas, US 161–175, 1994.
  7. I. Suzuki, Y. Mikami, A. Ohsato, Y. Chubachi, “A language and character set determination method based on N- gram statistics”, ACM Transactions on Asian Language Information Processing(TALIP), Volume 1 Issue 3, Pages 269 – 278, 2002.
  8. G. Adams, and P. Resnik,, “A language identification application built on the Java client-serandr platform”, In J. Burstein & C. Leacock (Eds), From Research to Commerc ial Applications: Making {NLP} Work in Practice Somerset, New Jersey: Association for Computational Linguistics, pp. 43 – 47, 1997.

Details

Primary Language

English

Subjects

-

Journal Section

-

Publication Date

December 25, 2013

Submission Date

December 25, 2013

Acceptance Date

-

Published in Issue

Year 2013 Volume: 13 Number: 2

APA
Bayrak Hayta, Ş., Hayta, Ş., Takçı, H., & Eminli, M. (2013). LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS. IU-Journal of Electrical & Electronics Engineering, 13(2), 1629-1639. https://izlik.org/JA53YX37YN
AMA
1.Bayrak Hayta Ş, Hayta Ş, Takçı H, Eminli M. LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS. IU-Journal of Electrical & Electronics Engineering. 2013;13(2):1629-1639. https://izlik.org/JA53YX37YN
Chicago
Bayrak Hayta, Şengül, Şengül Hayta, Hidayet Takçı, and Mübariz Eminli. 2013. “LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS”. IU-Journal of Electrical & Electronics Engineering 13 (2): 1629-39. https://izlik.org/JA53YX37YN.
EndNote
Bayrak Hayta Ş, Hayta Ş, Takçı H, Eminli M (December 1, 2013) LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS. IU-Journal of Electrical & Electronics Engineering 13 2 1629–1639.
IEEE
[1]Ş. Bayrak Hayta, Ş. Hayta, H. Takçı, and M. Eminli, “LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS”, IU-Journal of Electrical & Electronics Engineering, vol. 13, no. 2, pp. 1629–1639, Dec. 2013, [Online]. Available: https://izlik.org/JA53YX37YN
ISNAD
Bayrak Hayta, Şengül - Hayta, Şengül - Takçı, Hidayet - Eminli, Mübariz. “LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS”. IU-Journal of Electrical & Electronics Engineering 13/2 (December 1, 2013): 1629-1639. https://izlik.org/JA53YX37YN.
JAMA
1.Bayrak Hayta Ş, Hayta Ş, Takçı H, Eminli M. LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS. IU-Journal of Electrical & Electronics Engineering. 2013;13:1629–1639.
MLA
Bayrak Hayta, Şengül, et al. “LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS”. IU-Journal of Electrical & Electronics Engineering, vol. 13, no. 2, Dec. 2013, pp. 1629-3, https://izlik.org/JA53YX37YN.
Vancouver
1.Şengül Bayrak Hayta, Şengül Hayta, Hidayet Takçı, Mübariz Eminli. LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS. IU-Journal of Electrical & Electronics Engineering [Internet]. 2013 Dec. 1;13(2):1629-3. Available from: https://izlik.org/JA53YX37YN