An Efficient Document Categorization Approach for Turkish Based Texts

Cilt: 3 Sayı: 1 13 Ocak 2015
PDF İndir
EN

An Efficient Document Categorization Approach for Turkish Based Texts

Abstract

Since, it is infeasible to classify all the documents with human effort due to the rapid and uncontrollable growth in textual data, automatic methods have been approached in order to organize the data. Therefore a support vector machine (SVM) classifier is used for text categorization in this study. In text categorization applications, the text representation process could take a huge computation time on weighting the huge size of terms. So far, lexicons that contain less number of terms are used for the solution in the literature. However it has been observed that these kinds of solutions reduce the accuracy of the text classification. In this paper, the term-document matrix is constructed as user dependent according to the purpose of classification. Since the number of terms is still relatively large, we used a hash table for efficient search of terms. Hereby an efficient and rapid TF-IDF method is introduced to construct a weight-matrix to represent the term-document relations and a study concerning classification of the documents in Turkish based news and Turkish columnists is conducted. With the proposed study, the computational time that is required for term-weighting process is reduced substantially; also 99% accuracy is achieved in determination of the news categories and 98% accuracy is achieved in detection of the columnists.

Keywords

Kaynakça

  1. M. A. Kumar, and M. Gopal, “A comparison study on multiple binary-class SVM methods for unilabel text categorization,” Pattern Recognition Letters, vol. 31, pp. 1437-1444, Aug. 2010.
  2. F. Sebastiani, Text Categorization, A. Zanasi, Ed. Southampton, UK: WIT Press, 2005.
  3. W. Zhang, T. Yoshida, and X. Tang, “Text classification based on multi-word with support vector machine,” Knowledge-Based Systems, vol. 21, pp. 879-886, Dec. 2008.
  4. W. Li, D. Miao, and W. Wang, “Two-level hierarchical combination method for text classification,” Expert Systems with Applications, vol. 38, pp. 2030-2039, Mar. 2011.
  5. A. Sun, E. Lim, and Y. Liu, “On strategies for imbalanced text classification using SVM: A comparative study,” Decision Support Systems, vol. 48, pp. 191-201, Dec. 2009.
  6. D. Miao, Q. Duan, H. Zhang, and N. Jiao, “Rough set based hybrid algorithm for text classification,” Expert Systems with Applications, vol. 36, pp. 9168-9174, July 2009.
  7. L L. Shi, X. Ma, L. Xi, Q. Duan, and J. Zhao, “Rough set and ensemble learning based semi-supervised algorithm for text classification,” Expert Systems with Applications, vol. 38, pp. 6300-6306, May 2011.
  8. V. Mitra, C. Wang, and S. Banerjee, “Text classification: A least square support vector machine approach,” Applied Soft Computing, vol. 7, pp. 908-914, June 2007.

Ayrıntılar

Birincil Dil

İngilizce

Konular

-

Bölüm

-

Yayımlanma Tarihi

13 Ocak 2015

Gönderilme Tarihi

14 Ekim 2014

Kabul Tarihi

-

Yayımlandığı Sayı

Yıl 2015 Cilt: 3 Sayı: 1

Kaynak Göster

APA
İlhan Omurca, S., Baş, S., & Ekinci, E. (2015). An Efficient Document Categorization Approach for Turkish Based Texts. International Journal of Intelligent Systems and Applications in Engineering, 3(1), 7-13. https://doi.org/10.18201/ijisae.94177
AMA
1.İlhan Omurca S, Baş S, Ekinci E. An Efficient Document Categorization Approach for Turkish Based Texts. International Journal of Intelligent Systems and Applications in Engineering. 2015;3(1):7-13. doi:10.18201/ijisae.94177
Chicago
İlhan Omurca, Sevinç, Semih Baş, ve Ekin Ekinci. 2015. “An Efficient Document Categorization Approach for Turkish Based Texts”. International Journal of Intelligent Systems and Applications in Engineering 3 (1): 7-13. https://doi.org/10.18201/ijisae.94177.
EndNote
İlhan Omurca S, Baş S, Ekinci E (01 Ocak 2015) An Efficient Document Categorization Approach for Turkish Based Texts. International Journal of Intelligent Systems and Applications in Engineering 3 1 7–13.
IEEE
[1]S. İlhan Omurca, S. Baş, ve E. Ekinci, “An Efficient Document Categorization Approach for Turkish Based Texts”, International Journal of Intelligent Systems and Applications in Engineering, c. 3, sy 1, ss. 7–13, Oca. 2015, doi: 10.18201/ijisae.94177.
ISNAD
İlhan Omurca, Sevinç - Baş, Semih - Ekinci, Ekin. “An Efficient Document Categorization Approach for Turkish Based Texts”. International Journal of Intelligent Systems and Applications in Engineering 3/1 (01 Ocak 2015): 7-13. https://doi.org/10.18201/ijisae.94177.
JAMA
1.İlhan Omurca S, Baş S, Ekinci E. An Efficient Document Categorization Approach for Turkish Based Texts. International Journal of Intelligent Systems and Applications in Engineering. 2015;3:7–13.
MLA
İlhan Omurca, Sevinç, vd. “An Efficient Document Categorization Approach for Turkish Based Texts”. International Journal of Intelligent Systems and Applications in Engineering, c. 3, sy 1, Ocak 2015, ss. 7-13, doi:10.18201/ijisae.94177.
Vancouver
1.Sevinç İlhan Omurca, Semih Baş, Ekin Ekinci. An Efficient Document Categorization Approach for Turkish Based Texts. International Journal of Intelligent Systems and Applications in Engineering. 01 Ocak 2015;3(1):7-13. doi:10.18201/ijisae.94177

Cited By