Research Article
BibTex RIS Cite

Topic and Sub-Topic Detection Model in English Documents

Year 2018, , 754 - 764, 01.08.2018
https://doi.org/10.29130/dubited.420104

Abstract

In this article, a model of topic and sub topic detection is proposed in the documents and experimental findings are evaluated. The Gestalt theory based on the Helmholtz principle was used in the documents to determine the meaningful words that could be used to determine concepts and sub topic. An Artificial Neural Network (ANN) model was established in which these words were entered, and this network was trained with number of 140 training documents. The training and testing document dataset is about the sports and training topics and 14 subtopics have been selected. The output of ANN gives the topic and sub topic information. Experiments were executed with 70 test documents with different numbers of (5, 10, 20) words. It was observed that the success rate was approximately 95% in the topic and 80% in the sub topic.

References

  • [1] Y. H. Li A. K. Jain, “Classification of Text Documents,” The Computer Journal,” c. 41, s. 8, ss. 537–546, 1998.
  • [2] Yu, E.S. ve E.D Liddy, “Feature selection in text categorization using the Baldwin effect,” IJCNN '99. International Joint Conference on Neural Networks, Washington, ABD, 1999
  • [3] Bekkerman R., Ran El-Yaniv, Naftali T., Yoad W., “Distributional Word Clusters vs. Words for Text Categorization ,” Journal of Machine Learning Research, ss. 1-48, 2002.
  • [4] Song, F., Liu, S., Yang, J., “A comparative study on text representation schemes in text categorization,” Pattern Analysis and Applications, c.8, s.1-2, 199-209, 2005
  • [5] Amasyalı M.F, Diri, B., “Automatic Turkish Text Categorization in Terms of Author, Genre and Gender,” 11th International Caonferance on Applications of Natural Language to Information Systems-NLDB2006, ss.221-226, 2006
  • [6] Türkoğlu, F., Diri, B., Amasyalı, M. F., “Author Attribution of Turkish Texts by Feature Mining,” International Conference on Intelligent Computing, Qingdao, Çin, ss. 1086-1093, 2007.
  • [7] Çiltik, A. ve Güngör, T., “Time-Efficient Spam E-mail Filtering Using N-gram Models,” Pattern Recognition Letters, c. 29,s. 1, ss.19-33, 2008.
  • [8] Helen Balinsky, Alexander Balinsky, Steven Simske, “Document sentences as a small world,” 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Anchorage, ABD, 2011
  • [9] Ghiassi, M., Skinner, J., & Zimbra, D. “Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network,” Expert System Applications, c. 40,s. 16, ss. 6266-6282, 2013
  • [10] D Tanasa, B Trousse "Advanced data preprocessing for intersites web usage mining,” IEEE Intelligent Systems, c. 19, s.2, 2004
  • [11] V.Chitraa, Dr. Antony Selvdoss Davamani “A Survey on Preprocessing Methods for Web Usage Data,” International Journal of Computer Science and Information Security, c.7, s.3, 2010
  • [12] Helen Balinsky, Alexander Balinsky, Steven Simske, “On the Helmholtz principle for data mining,” Third International Conference on Emerging Security Technologies (EST), Lisbon, Portekiz ,2012
  • [13] Helen Balinsky, Alexander Balinsky, Steven Simske, “On Helmholtz’s principle for documents processing,” Proceedings of the 10th ACM symposium on Document engineering, Manchester, İngiltere, ss. 283-286, 2010 [14] Melike T. Murat Can G., Selim A. ”Metin Sınıflandırma için Eğitimsiz Bir Anlamsal Özellik Seçimi Yöntemi,” Bilgisayar ve Biyomeikal Mühendisliği Sempozyumu, Bursa, Türkiye, 2014
  • [15] Metin T., Coskun S., ”Automatize Document Topic and Subtopic Detection with Support of a Corpus,” Procedia - Social and Behavioral Sciences, c. 177, ss. 169-177

İngilizce Dokümanlarda Tema ve Alt Kavramlar Tespit Modeli

Year 2018, , 754 - 764, 01.08.2018
https://doi.org/10.29130/dubited.420104

Abstract

Bu makalede
dokümanlarda tema ve alt kavram tespiti konusunda bir model önerilmiş ve
deneysel bulgular değerlendirilmiştir. Dokümanlarda tema ve alt kavramların
tespiti için kullanılabilecek anlamlı sözcüklerin belirlenmesi amacıyla Helmholtz
prensibi temelli Gestalt teorisi kullanılmıştır. Bu sözcüklerin girdi olduğu
bir Yapay Sinir Ağı (YSA) modeli oluşturulmuş, eğitim dokümanları (140 adet)
ile bu ağ eğitilmiştir. Eğitim ve sınama doküman veri seti spor ve eğitim
temalarında olup, toplam 14 alt kavram seçilmiştir. YSA’nın çıktısı tema ve
alt-kavram bilgilerini vermektedir. 70 adet sınama dokümanı ile farklı sayıda
(5, 10, 20) anlamlı kelime seçilerek deneyler yapılmış, başarı oranının
konularda yaklaşık olarak %95, alt kavramlarda ise %80 olduğu gözlemlenmiştir.

References

  • [1] Y. H. Li A. K. Jain, “Classification of Text Documents,” The Computer Journal,” c. 41, s. 8, ss. 537–546, 1998.
  • [2] Yu, E.S. ve E.D Liddy, “Feature selection in text categorization using the Baldwin effect,” IJCNN '99. International Joint Conference on Neural Networks, Washington, ABD, 1999
  • [3] Bekkerman R., Ran El-Yaniv, Naftali T., Yoad W., “Distributional Word Clusters vs. Words for Text Categorization ,” Journal of Machine Learning Research, ss. 1-48, 2002.
  • [4] Song, F., Liu, S., Yang, J., “A comparative study on text representation schemes in text categorization,” Pattern Analysis and Applications, c.8, s.1-2, 199-209, 2005
  • [5] Amasyalı M.F, Diri, B., “Automatic Turkish Text Categorization in Terms of Author, Genre and Gender,” 11th International Caonferance on Applications of Natural Language to Information Systems-NLDB2006, ss.221-226, 2006
  • [6] Türkoğlu, F., Diri, B., Amasyalı, M. F., “Author Attribution of Turkish Texts by Feature Mining,” International Conference on Intelligent Computing, Qingdao, Çin, ss. 1086-1093, 2007.
  • [7] Çiltik, A. ve Güngör, T., “Time-Efficient Spam E-mail Filtering Using N-gram Models,” Pattern Recognition Letters, c. 29,s. 1, ss.19-33, 2008.
  • [8] Helen Balinsky, Alexander Balinsky, Steven Simske, “Document sentences as a small world,” 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Anchorage, ABD, 2011
  • [9] Ghiassi, M., Skinner, J., & Zimbra, D. “Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network,” Expert System Applications, c. 40,s. 16, ss. 6266-6282, 2013
  • [10] D Tanasa, B Trousse "Advanced data preprocessing for intersites web usage mining,” IEEE Intelligent Systems, c. 19, s.2, 2004
  • [11] V.Chitraa, Dr. Antony Selvdoss Davamani “A Survey on Preprocessing Methods for Web Usage Data,” International Journal of Computer Science and Information Security, c.7, s.3, 2010
  • [12] Helen Balinsky, Alexander Balinsky, Steven Simske, “On the Helmholtz principle for data mining,” Third International Conference on Emerging Security Technologies (EST), Lisbon, Portekiz ,2012
  • [13] Helen Balinsky, Alexander Balinsky, Steven Simske, “On Helmholtz’s principle for documents processing,” Proceedings of the 10th ACM symposium on Document engineering, Manchester, İngiltere, ss. 283-286, 2010 [14] Melike T. Murat Can G., Selim A. ”Metin Sınıflandırma için Eğitimsiz Bir Anlamsal Özellik Seçimi Yöntemi,” Bilgisayar ve Biyomeikal Mühendisliği Sempozyumu, Bursa, Türkiye, 2014
  • [15] Metin T., Coskun S., ”Automatize Document Topic and Subtopic Detection with Support of a Corpus,” Procedia - Social and Behavioral Sciences, c. 177, ss. 169-177
There are 14 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Metin Turan 0000-0002-1941-6693

Sena Ögtelik

Publication Date August 1, 2018
Published in Issue Year 2018

Cite

APA Turan, M., & Ögtelik, S. (2018). İngilizce Dokümanlarda Tema ve Alt Kavramlar Tespit Modeli. Duzce University Journal of Science and Technology, 6(4), 754-764. https://doi.org/10.29130/dubited.420104
AMA Turan M, Ögtelik S. İngilizce Dokümanlarda Tema ve Alt Kavramlar Tespit Modeli. DÜBİTED. August 2018;6(4):754-764. doi:10.29130/dubited.420104
Chicago Turan, Metin, and Sena Ögtelik. “İngilizce Dokümanlarda Tema Ve Alt Kavramlar Tespit Modeli”. Duzce University Journal of Science and Technology 6, no. 4 (August 2018): 754-64. https://doi.org/10.29130/dubited.420104.
EndNote Turan M, Ögtelik S (August 1, 2018) İngilizce Dokümanlarda Tema ve Alt Kavramlar Tespit Modeli. Duzce University Journal of Science and Technology 6 4 754–764.
IEEE M. Turan and S. Ögtelik, “İngilizce Dokümanlarda Tema ve Alt Kavramlar Tespit Modeli”, DÜBİTED, vol. 6, no. 4, pp. 754–764, 2018, doi: 10.29130/dubited.420104.
ISNAD Turan, Metin - Ögtelik, Sena. “İngilizce Dokümanlarda Tema Ve Alt Kavramlar Tespit Modeli”. Duzce University Journal of Science and Technology 6/4 (August 2018), 754-764. https://doi.org/10.29130/dubited.420104.
JAMA Turan M, Ögtelik S. İngilizce Dokümanlarda Tema ve Alt Kavramlar Tespit Modeli. DÜBİTED. 2018;6:754–764.
MLA Turan, Metin and Sena Ögtelik. “İngilizce Dokümanlarda Tema Ve Alt Kavramlar Tespit Modeli”. Duzce University Journal of Science and Technology, vol. 6, no. 4, 2018, pp. 754-6, doi:10.29130/dubited.420104.
Vancouver Turan M, Ögtelik S. İngilizce Dokümanlarda Tema ve Alt Kavramlar Tespit Modeli. DÜBİTED. 2018;6(4):754-6.