Research Article
BibTex RIS Cite

SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning

Year 2021, , 1433 - 1444, 01.12.2021
https://doi.org/10.2339/politeknik.684558

Abstract

There is growing interest in studies on text classification as a result of the exponential increase in the amount of data available. Many studies have been conducted in the field of text clustering, using different approaches. This study introduces Spectral Sentence Clustering (SSC) for text clustering problems, which is an unsupervised method based on graph-partitioning. The study explains how the proposed model proposed can be used in natural language applications to successfully cluster texts. A spectral graph theory method is used to partition the graph into non-intersecting sub-graphs, and an unsupervised and efficient solution is offered for the text clustering problem by providing a physical representation of the texts. Finally, tests have been conducted demonstrating that SSC can be successfully used for text categorization. A clustering success rate of 97.08% was achieved in tests conducted using the TTC-3600 dataset, which contains open-access unstructured Turkish texts, classified into categories. The SSC model proposed performed better compared to a popular k-means clustering algorithm.

References

  • [1] Ş. Canberk, G. , Sağıroğlu, Bilgi ve Bilgisayar Güvenliği : Casus Yazılımlar ve Korunma Yöntemleri. Ankara: Grafiker Yayıncılık, 2006.
  • [2] Osman Durmaz;, “Metin Sınıflandırmada Boyut Azaltmanın Etkisi ve Özellik Seçimi,” Gazi Üniversitesi, 2011.
  • [3] C. Hark, A. Seyyarer, T. Uçkan, and A. Karci, “Doǧal dil işleme yaklaşimlari ile yapisal olmayan dökümanlarin benzerliǧi,” in IDAP 2017 - International Artificial Intelligence and Data Processing Symposium, 2017.
  • [4] D. D. Lewis, “Naive (Bayes) at forty: The independence assumption in information retrieval,” in European conference on machine learning, 1998, pp. 4–15.
  • [5] K. Aas and L. Eikvil, “Text categorisation: A survey.” Technical report, Norwegian computing center, 1999.
  • [6] N. O. Andrews and E. A. Fox, “Recent Developments in Document Clustering,” 2007.
  • [7] N. Shah, “Document Clustering : A Detailed Review,” vol. 4, no. 5, pp. 30–38, 2012.
  • [8] A. Pujari, “Data mining techniques,” 2001.
  • [9] S. N. Bharath Bhushan and A. Danti, “Classification of text documents based on score level fusion approach,” Pattern Recognit. Lett., vol. 94, pp. 118–126, Jul. 2017.
  • [10] R. J. Wilson and J. J. Watkins, Graphs: an introductory approach: a first course in discrete mathematics. John Wiley & Sons Inc, 1990.
  • [11] H. Qiu and E. R. Hancock, “Graph matching and clustering using spectral partitions,” Pattern Recognit., vol. 39, pp. 22–34, 2006.
  • [12] D. Kılınç, A. Özçift, F. Bozyigit, P. Yıldırım, F. Yücalar, and E. Borandag, “TTC-3600: A new benchmark dataset for Turkish text categorization,” J. Inf. Sci., vol. 43, no. 2, pp. 174–185, Apr. 2017.
  • [13] C. Hark, A. Seyyarer, T. Uçkan, and A. Karci, “Doǧal dil işleme yaklaşimlari ile yapisal olmayan dökümanlarin benzerliǧi,” in IDAP 2017 - International Artificial Intelligence and Data Processing Symposium, 2017, pp. 1–6.
  • [14] C. Hark, T. Uçkan, and A. Seyyarer, Abubekir Karci, “Metin Özetleme İçin Çizge Tabanlı Bir Öneri,” in IDAP 2018 - International Artificial Intelligence and Data Processing Symposium, 2018.
  • [15] A. Karci, “Çizge Algoritmaları ve Çizge Bölmeleme.pdf,” 1998.
  • [16] U. Von Luxburg, “A Tutorial on Spectral Clustering,” 2007.
  • [17] B. Slininger, “Fiedler’s Theory of Spectral Graph Partitioning.”
  • [18] H. Jia, S. Ding, and M. Du, “Self-Tuning p-Spectral Clustering Based on Shared Nearest Neighbors,” Cognit. Comput.
  • [19] L. Macdonald, “Successive Approximations by the Rayleigh-Ritz Variation Methoti,” 1933.
  • [20] L. Hagen, … A. K. on computer-aided design of, and U. 1992, “New spectral methods for ratio cut partitioning and clustering,” ieeexplore.ieee.org.
  • [21] S. Jianbo and M. Jitendra, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.
  • [22] A. Ben Ayed, M. Ben Halima, and A. M. Alimi, “Adaptive fuzzy exponent cluster ensemble system based feature selection and spectral clustering,” IEEE Int. Conf. Fuzzy Syst., no. July, 2017.
  • [23] M. Filippone, F. Camastra, F. Masulli, and S. Rovetta, “A survey of kernel and spectral methods for clustering,” Pattern Recognit., vol. 41, pp. 176–190, 2008.
  • [24] W. R. Casper and B. Nadiga, “A New Spectral Clustering Algorithm.”
  • [25] R. M. Alguliev, R. M. Aliguliyev, and M. S. Hajirahimova, “GenDocSum+ MCLR: Generic document summarization based on maximum coverage and less redundancy,” Expert Syst. Appl., vol. 39, no. 16, pp. 12460–12473, 2012.
  • [26] M. Fiedler, “Algebraic connectivity of graphs,” Czechoslov. Math. J., vol. 23, no. 2, pp. 298–305, 1973.
  • [27] F. R. K. Chung, “Lectures on Spectral Graph Theory.”
  • [28] C. T. Zheng, C. Liu, and H. S. Wong, “Corpus-based topic diffusion for short text clustering,” Neurocomputing, vol. 275, pp. 2444–2458, 2018.
  • [29] J. Xu et al., “Self-Taught convolutional neural networks for short text clustering,” Neural Networks, vol. 88, pp. 22–31, 2017.
  • [30] F. Beil, M. Ester, B. Bc, and C. Va, “Frequent Term-Based Text Clustering,” 2002.
  • [31] C. Manning, P. Raghavan, and H. Schütze, “Introduction to information retrieval,” Nat. Lang. Eng., vol. 16, no. 1, pp. 100–103, 2010.
  • [32] M. Kozlowski and H. Rybinski, “Clustering of semantically enriched short texts,” J. Intell. Inf. Syst., Dec. 2018.

SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning

Year 2021, , 1433 - 1444, 01.12.2021
https://doi.org/10.2339/politeknik.684558

Abstract

There is growing interest in studies on text classification as a result of the exponential increase in the amount of data available. Many studies have been conducted in the field of text clustering, using different approaches. This study introduces Spectral Sentence Clustering (SSC) for text clustering problems, which is an unsupervised method based on graph-partitioning. The study explains how the proposed model proposed can be used in natural language applications to successfully cluster texts. A spectral graph theory method is used to partition the graph into non-intersecting sub-graphs, and an unsupervised and efficient solution is offered for the text clustering problem by providing a physical representation of the texts. Finally, tests have been conducted demonstrating that SSC can be successfully used for text categorization. A clustering success rate of 97.08% was achieved in tests conducted using the TTC-3600 dataset, which contains open-access unstructured Turkish texts, classified into categories. The SSC model proposed performed better compared to a popular k-means clustering algorithm.

References

  • [1] Ş. Canberk, G. , Sağıroğlu, Bilgi ve Bilgisayar Güvenliği : Casus Yazılımlar ve Korunma Yöntemleri. Ankara: Grafiker Yayıncılık, 2006.
  • [2] Osman Durmaz;, “Metin Sınıflandırmada Boyut Azaltmanın Etkisi ve Özellik Seçimi,” Gazi Üniversitesi, 2011.
  • [3] C. Hark, A. Seyyarer, T. Uçkan, and A. Karci, “Doǧal dil işleme yaklaşimlari ile yapisal olmayan dökümanlarin benzerliǧi,” in IDAP 2017 - International Artificial Intelligence and Data Processing Symposium, 2017.
  • [4] D. D. Lewis, “Naive (Bayes) at forty: The independence assumption in information retrieval,” in European conference on machine learning, 1998, pp. 4–15.
  • [5] K. Aas and L. Eikvil, “Text categorisation: A survey.” Technical report, Norwegian computing center, 1999.
  • [6] N. O. Andrews and E. A. Fox, “Recent Developments in Document Clustering,” 2007.
  • [7] N. Shah, “Document Clustering : A Detailed Review,” vol. 4, no. 5, pp. 30–38, 2012.
  • [8] A. Pujari, “Data mining techniques,” 2001.
  • [9] S. N. Bharath Bhushan and A. Danti, “Classification of text documents based on score level fusion approach,” Pattern Recognit. Lett., vol. 94, pp. 118–126, Jul. 2017.
  • [10] R. J. Wilson and J. J. Watkins, Graphs: an introductory approach: a first course in discrete mathematics. John Wiley & Sons Inc, 1990.
  • [11] H. Qiu and E. R. Hancock, “Graph matching and clustering using spectral partitions,” Pattern Recognit., vol. 39, pp. 22–34, 2006.
  • [12] D. Kılınç, A. Özçift, F. Bozyigit, P. Yıldırım, F. Yücalar, and E. Borandag, “TTC-3600: A new benchmark dataset for Turkish text categorization,” J. Inf. Sci., vol. 43, no. 2, pp. 174–185, Apr. 2017.
  • [13] C. Hark, A. Seyyarer, T. Uçkan, and A. Karci, “Doǧal dil işleme yaklaşimlari ile yapisal olmayan dökümanlarin benzerliǧi,” in IDAP 2017 - International Artificial Intelligence and Data Processing Symposium, 2017, pp. 1–6.
  • [14] C. Hark, T. Uçkan, and A. Seyyarer, Abubekir Karci, “Metin Özetleme İçin Çizge Tabanlı Bir Öneri,” in IDAP 2018 - International Artificial Intelligence and Data Processing Symposium, 2018.
  • [15] A. Karci, “Çizge Algoritmaları ve Çizge Bölmeleme.pdf,” 1998.
  • [16] U. Von Luxburg, “A Tutorial on Spectral Clustering,” 2007.
  • [17] B. Slininger, “Fiedler’s Theory of Spectral Graph Partitioning.”
  • [18] H. Jia, S. Ding, and M. Du, “Self-Tuning p-Spectral Clustering Based on Shared Nearest Neighbors,” Cognit. Comput.
  • [19] L. Macdonald, “Successive Approximations by the Rayleigh-Ritz Variation Methoti,” 1933.
  • [20] L. Hagen, … A. K. on computer-aided design of, and U. 1992, “New spectral methods for ratio cut partitioning and clustering,” ieeexplore.ieee.org.
  • [21] S. Jianbo and M. Jitendra, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.
  • [22] A. Ben Ayed, M. Ben Halima, and A. M. Alimi, “Adaptive fuzzy exponent cluster ensemble system based feature selection and spectral clustering,” IEEE Int. Conf. Fuzzy Syst., no. July, 2017.
  • [23] M. Filippone, F. Camastra, F. Masulli, and S. Rovetta, “A survey of kernel and spectral methods for clustering,” Pattern Recognit., vol. 41, pp. 176–190, 2008.
  • [24] W. R. Casper and B. Nadiga, “A New Spectral Clustering Algorithm.”
  • [25] R. M. Alguliev, R. M. Aliguliyev, and M. S. Hajirahimova, “GenDocSum+ MCLR: Generic document summarization based on maximum coverage and less redundancy,” Expert Syst. Appl., vol. 39, no. 16, pp. 12460–12473, 2012.
  • [26] M. Fiedler, “Algebraic connectivity of graphs,” Czechoslov. Math. J., vol. 23, no. 2, pp. 298–305, 1973.
  • [27] F. R. K. Chung, “Lectures on Spectral Graph Theory.”
  • [28] C. T. Zheng, C. Liu, and H. S. Wong, “Corpus-based topic diffusion for short text clustering,” Neurocomputing, vol. 275, pp. 2444–2458, 2018.
  • [29] J. Xu et al., “Self-Taught convolutional neural networks for short text clustering,” Neural Networks, vol. 88, pp. 22–31, 2017.
  • [30] F. Beil, M. Ester, B. Bc, and C. Va, “Frequent Term-Based Text Clustering,” 2002.
  • [31] C. Manning, P. Raghavan, and H. Schütze, “Introduction to information retrieval,” Nat. Lang. Eng., vol. 16, no. 1, pp. 100–103, 2010.
  • [32] M. Kozlowski and H. Rybinski, “Clustering of semantically enriched short texts,” J. Intell. Inf. Syst., Dec. 2018.
There are 32 citations in total.

Details

Primary Language English
Subjects Engineering
Journal Section Research Article
Authors

Taner Uçkan 0000-0001-5385-6775

Cengiz Hark 0000-0002-5190-3504

Ali Karci 0000-0002-8489-8617

Publication Date December 1, 2021
Submission Date February 5, 2020
Published in Issue Year 2021

Cite

APA Uçkan, T., Hark, C., & Karci, A. (2021). SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning. Politeknik Dergisi, 24(4), 1433-1444. https://doi.org/10.2339/politeknik.684558
AMA Uçkan T, Hark C, Karci A. SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning. Politeknik Dergisi. December 2021;24(4):1433-1444. doi:10.2339/politeknik.684558
Chicago Uçkan, Taner, Cengiz Hark, and Ali Karci. “SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning”. Politeknik Dergisi 24, no. 4 (December 2021): 1433-44. https://doi.org/10.2339/politeknik.684558.
EndNote Uçkan T, Hark C, Karci A (December 1, 2021) SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning. Politeknik Dergisi 24 4 1433–1444.
IEEE T. Uçkan, C. Hark, and A. Karci, “SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning”, Politeknik Dergisi, vol. 24, no. 4, pp. 1433–1444, 2021, doi: 10.2339/politeknik.684558.
ISNAD Uçkan, Taner et al. “SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning”. Politeknik Dergisi 24/4 (December 2021), 1433-1444. https://doi.org/10.2339/politeknik.684558.
JAMA Uçkan T, Hark C, Karci A. SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning. Politeknik Dergisi. 2021;24:1433–1444.
MLA Uçkan, Taner et al. “SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning”. Politeknik Dergisi, vol. 24, no. 4, 2021, pp. 1433-44, doi:10.2339/politeknik.684558.
Vancouver Uçkan T, Hark C, Karci A. SSC: Clustering Of Turkish Texts By Spectral Graph Partitioning. Politeknik Dergisi. 2021;24(4):1433-44.
 
TARANDIĞIMIZ DİZİNLER (ABSTRACTING / INDEXING)
181341319013191 13189 13187 13188 18016 

download Bu eser Creative Commons Atıf-AynıLisanslaPaylaş 4.0 Uluslararası ile lisanslanmıştır.