LANGUAGE IDENTIFICATION BASED ON N-GRAM FEATURE EXTRACTION METHOD BY USING CLASSIFIERS
Year 2013,
Volume: 13 Issue: 2, 1629 - 1639, 25.12.2013
Şengül Bayrak Hayta
,
Şengül Hayta
,
Hidayet Takçı
,
Mübariz Eminli
Abstract
The rising opportunities of communication provided us with many documents in many different languages. Language identification has a key role for these documents to be understandable and to study natural language identification procedures. The increasing number of the documents and international communication requirements make new works on language identification obligatory. Until today, there have been a great number of studies on solving language identification problem about document based language identification. In these studies, characters, words and n-gram sequences have been used with machine learning techniques. In this study, sequence of n-gram frequencies will be used and using of the five different classification algorithms' accuracy performances will be analyzed via different sizes of documents belonging to 15 different languages. N-gram based feature method will be used to extract feature vector belonging to languages. The most appropriate method for the problem of language identification will be identified by comparing the performances of the Support Vector Machines, Multilayer Perceptron, Centroid Classifier, k-Means and Fuzzy C Means methods. During the experiments, trainining and testing data will be selected from ECI multilingual corpus.
References
- T. Dunning,“Statistical identification of language”, Technical Report CRL Technical Memo MCCS-94-273, University of New Mexico, ESANN.99, Belgium, 1994.
- Y. Ueda,, S. Nakagawa,, “Prediction for phoneme/syllable/word-category and identification of language using HMM.”, International Conference on Spoken Language Processing, volume 2, pages 12091212,1990.
- C. Manning,, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, Boston, 1994.
- European Corpus Initiative Multilingual Corpus (ECI/MCI)(2005),http://www.elsnet.org/resources/eciC orps. html, Page last modified 29-03-2005.
- G. Grefenstette,, “Comparing two language identification schemes”, Proceedings of JADT 3rd International Conference on Statistical Analys is of Textual Data, 1995.
- W.B. Cavnar, J.M. Trenkle, ”N-gram-based text categorization”, Proceedings of SDAIR-94, 3rd A n n u a l Symposium on Document Analys is and Information Retrieval, Las Vegas, US 161–175, 1994.
- I. Suzuki, Y. Mikami, A. Ohsato, Y. Chubachi, “A language and character set determination method based on N- gram statistics”, ACM Transactions on Asian Language Information Processing(TALIP), Volume 1 Issue 3, Pages 269 – 278, 2002.
- G. Adams, and P. Resnik,, “A language identification application built on the Java client-serandr platform”, In J. Burstein & C. Leacock (Eds), From Research to Commerc ial Applications: Making {NLP} Work in Practice Somerset, New Jersey: Association for Computational Linguistics, pp. 43 – 47, 1997.
- K. Beesley, “Language Identifier: A computer program for automatic natural language identification of on-line text”, Proceedings of the 29th Annual Conference of the American Translators Association, 47 – 54, 1988.
- T . Dunning, “Statistical i d ent i fi cat ion of language”, Technical Report, New Mexico State Uniandrsity, CRL MCCS-94-273, 1994.
- H. Combrinck, and E. Botha, “Automatic language identification: Resisting complexity”, South African Computer Journal, 27, 18 – 26, 1995.
- J .M.Prager, L i n g u i n i , “ Language i d e n t i f i c a t i o n f o r multilingual documents”., Journal of Management Information Systems, 16(3), 71 – 101, 1999.
- A . Xafopoulos, C . K o t r o p o u l o s , G . A l mp a n i d i s ,, an d Pitas, I., “Language identification in web documents using discrete hidden Markov models”, Pattern Recognition, 37(3), 583 – 594, 2004.
- H. Takçı,, İ. Soğukpınar,, “Centroid-Based Language Identification Using Letter Feature Set”, Lecture Notes in Computer Science, (CICLING 2004)
- Springer-Verlag, Vol.2945/2004, pages 635-645, February 2004.
- H. Takçı, E. Ekinci, “Minimal feature set in language identification and finding suitable classification method with it”, SCIVerse Science Direct, Procedia Technology 1, pages 444-448, 2011.
- TanagraTool,http://eric.univlyon2.fr/ricco/tanagra/en /tanagra.html. C. Cortes, V. Vapnik, “ Support-vector networks”, Machine Learning, Vol: 20(3), pp.273–297, 1995.
- S. Knerr, L. Personnaz, G. Dreyfus, Single-layer learning revisited: A stepwise procedure for building and training a neural network, in Neurocomputing: Algorithms, Architectures and Applications, J. Fogelman, Ed. New York: Springer-Verlag, 1990.
- E. Mayoraz, and E. Alpaydin,, “Support vector machines for multi-class classification,” in IWANN, Vol: 2, pp. 833–842, 1999.
- Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support andctor machines, 2001.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- T. Joachims, Learning to Classify Text using Support Andctor Machines. Kluwer, Boston, 2002.
- Z. Şen, Mühendislikte Bulanık (Fuzzy Mantık) ile Modelleme Prensipleri, Su Vakfı, İstanbul, 2004.
- Ç. Elmas, Yapay Sinir Ağları Kuram, Mimari, Uygulama, Seckin Yayıncılık, Ankara, 2003.
- S.Saarinen, R.B.Bramley, and G. Cybenko, 1992. “Neural Networks, backpropagation and automatic differentiation”, in Automatic Differantiation of Algorithms: Theory, Implementation, and Application, A.Griewank and G.F.Corliss, eds.,pp.3142,Philedelphia:SIAM.
- Ş. Bayrak, H. Takçı, M. Eminli, 2012.”Makine Öğrenme Yôntemleriyle N-Gram Tabanlı Dil Tanıma”, Elektrik-Elektronik ve Bilgisayar Mùhendisliği Sempozyumu, ELECO. http://www.mathworks.com/products/matlab/nntol Şengül Bayrak was born in 1986. She was graduated from Halic University, Computer Engineering Department, 200 She was graduated master degree from Halic University, 2011. She has worked as a Research Assistant Halic University since 2009. Her studies are about Data Mining, Language Identification, Database Management, Data Security.
- Hidayet Takçı was born in 1974. He was licence degree from Trakya University, Computer Engineering, 200 He was graduated master degree and PhD degree from Gebze Institute of Technology Computer Enginerering degreeHe worked as a lecturer in Gebze Institute of Technology, in Department of Computer Engineering, 2002-2011. He has worked Department of Computer Engineering, Cumhuriyet University as Vice President & Head of the Department Software Engineering since 2011. His experiments are about Data Mining and Text Mining, Neural Networks and Applications, Language Identification, Author Identification, Criminal Data Mining, Information Crime.
- Mübariz Eminli was born in 1948. He was licence degree from Azerbaycan Technical University, Otomation and Computer Technologies Faculty, 1971. He was graduated master degree and PhD from Azerbaycan Technical University Informatic and Computer Technologies
- Department. He has worked as a lecturer at Halic University Department of Computer Engineering since 200 His experiments are Fuzzy Clustering, Fuzzy Modelling, Artificial Neural Network, Data Mining.
Year 2013,
Volume: 13 Issue: 2, 1629 - 1639, 25.12.2013
Şengül Bayrak Hayta
,
Şengül Hayta
,
Hidayet Takçı
,
Mübariz Eminli
References
- T. Dunning,“Statistical identification of language”, Technical Report CRL Technical Memo MCCS-94-273, University of New Mexico, ESANN.99, Belgium, 1994.
- Y. Ueda,, S. Nakagawa,, “Prediction for phoneme/syllable/word-category and identification of language using HMM.”, International Conference on Spoken Language Processing, volume 2, pages 12091212,1990.
- C. Manning,, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, Boston, 1994.
- European Corpus Initiative Multilingual Corpus (ECI/MCI)(2005),http://www.elsnet.org/resources/eciC orps. html, Page last modified 29-03-2005.
- G. Grefenstette,, “Comparing two language identification schemes”, Proceedings of JADT 3rd International Conference on Statistical Analys is of Textual Data, 1995.
- W.B. Cavnar, J.M. Trenkle, ”N-gram-based text categorization”, Proceedings of SDAIR-94, 3rd A n n u a l Symposium on Document Analys is and Information Retrieval, Las Vegas, US 161–175, 1994.
- I. Suzuki, Y. Mikami, A. Ohsato, Y. Chubachi, “A language and character set determination method based on N- gram statistics”, ACM Transactions on Asian Language Information Processing(TALIP), Volume 1 Issue 3, Pages 269 – 278, 2002.
- G. Adams, and P. Resnik,, “A language identification application built on the Java client-serandr platform”, In J. Burstein & C. Leacock (Eds), From Research to Commerc ial Applications: Making {NLP} Work in Practice Somerset, New Jersey: Association for Computational Linguistics, pp. 43 – 47, 1997.
- K. Beesley, “Language Identifier: A computer program for automatic natural language identification of on-line text”, Proceedings of the 29th Annual Conference of the American Translators Association, 47 – 54, 1988.
- T . Dunning, “Statistical i d ent i fi cat ion of language”, Technical Report, New Mexico State Uniandrsity, CRL MCCS-94-273, 1994.
- H. Combrinck, and E. Botha, “Automatic language identification: Resisting complexity”, South African Computer Journal, 27, 18 – 26, 1995.
- J .M.Prager, L i n g u i n i , “ Language i d e n t i f i c a t i o n f o r multilingual documents”., Journal of Management Information Systems, 16(3), 71 – 101, 1999.
- A . Xafopoulos, C . K o t r o p o u l o s , G . A l mp a n i d i s ,, an d Pitas, I., “Language identification in web documents using discrete hidden Markov models”, Pattern Recognition, 37(3), 583 – 594, 2004.
- H. Takçı,, İ. Soğukpınar,, “Centroid-Based Language Identification Using Letter Feature Set”, Lecture Notes in Computer Science, (CICLING 2004)
- Springer-Verlag, Vol.2945/2004, pages 635-645, February 2004.
- H. Takçı, E. Ekinci, “Minimal feature set in language identification and finding suitable classification method with it”, SCIVerse Science Direct, Procedia Technology 1, pages 444-448, 2011.
- TanagraTool,http://eric.univlyon2.fr/ricco/tanagra/en /tanagra.html. C. Cortes, V. Vapnik, “ Support-vector networks”, Machine Learning, Vol: 20(3), pp.273–297, 1995.
- S. Knerr, L. Personnaz, G. Dreyfus, Single-layer learning revisited: A stepwise procedure for building and training a neural network, in Neurocomputing: Algorithms, Architectures and Applications, J. Fogelman, Ed. New York: Springer-Verlag, 1990.
- E. Mayoraz, and E. Alpaydin,, “Support vector machines for multi-class classification,” in IWANN, Vol: 2, pp. 833–842, 1999.
- Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support andctor machines, 2001.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- T. Joachims, Learning to Classify Text using Support Andctor Machines. Kluwer, Boston, 2002.
- Z. Şen, Mühendislikte Bulanık (Fuzzy Mantık) ile Modelleme Prensipleri, Su Vakfı, İstanbul, 2004.
- Ç. Elmas, Yapay Sinir Ağları Kuram, Mimari, Uygulama, Seckin Yayıncılık, Ankara, 2003.
- S.Saarinen, R.B.Bramley, and G. Cybenko, 1992. “Neural Networks, backpropagation and automatic differentiation”, in Automatic Differantiation of Algorithms: Theory, Implementation, and Application, A.Griewank and G.F.Corliss, eds.,pp.3142,Philedelphia:SIAM.
- Ş. Bayrak, H. Takçı, M. Eminli, 2012.”Makine Öğrenme Yôntemleriyle N-Gram Tabanlı Dil Tanıma”, Elektrik-Elektronik ve Bilgisayar Mùhendisliği Sempozyumu, ELECO. http://www.mathworks.com/products/matlab/nntol Şengül Bayrak was born in 1986. She was graduated from Halic University, Computer Engineering Department, 200 She was graduated master degree from Halic University, 2011. She has worked as a Research Assistant Halic University since 2009. Her studies are about Data Mining, Language Identification, Database Management, Data Security.
- Hidayet Takçı was born in 1974. He was licence degree from Trakya University, Computer Engineering, 200 He was graduated master degree and PhD degree from Gebze Institute of Technology Computer Enginerering degreeHe worked as a lecturer in Gebze Institute of Technology, in Department of Computer Engineering, 2002-2011. He has worked Department of Computer Engineering, Cumhuriyet University as Vice President & Head of the Department Software Engineering since 2011. His experiments are about Data Mining and Text Mining, Neural Networks and Applications, Language Identification, Author Identification, Criminal Data Mining, Information Crime.
- Mübariz Eminli was born in 1948. He was licence degree from Azerbaycan Technical University, Otomation and Computer Technologies Faculty, 1971. He was graduated master degree and PhD from Azerbaycan Technical University Informatic and Computer Technologies
- Department. He has worked as a lecturer at Halic University Department of Computer Engineering since 200 His experiments are Fuzzy Clustering, Fuzzy Modelling, Artificial Neural Network, Data Mining.