News categorization, which is a common application area of text classification, is the task of automatic annotation of news articles with predefined categories. In parallel with the rise of deep learning techniques in the field of machine learning, neural embedding models have been widely utilized to capture hidden relationships and similarities among textual representations of news articles. In this study, we approach the Turkish news categorization problem as an ad-hoc retrieval task and investigate the effectiveness of paragraph vector models to compute and utilize document-wise similarities of Turkish news articles. We propose an ensemble categorization approach that consists of three main stages, namely, document processing, paragraph vector learning, and document similarity estimation. Extensive experiments conducted on the TTC-3600 dataset reveal that the proposed system can reach up to 93.5% classification accuracy, which is a remarkable performance when compared to the baseline and state-of-the-art methods. Moreover, it is also shown that the Distributed Bag of Words version of Paragraph Vectors performs better than the Distributed Memory Model of Paragraph Vectors in terms of both accuracy and computational performance.
Turkish news categorization Text classification Neural embeddings Paragraph vectors Document similarity
The author would like to thank the editor and anonymous reviewers.
News categorization, which is a common application area of text classification, is the task of automatic annotation of news articles with predefined categories. In parallel with the rise of deep learning techniques in the field of machine learning, neural embedding models have been widely utilized to capture hidden relationships and similarities among textual representations of news articles. In this study, we approach the Turkish news categorization problem as an ad-hoc retrieval task and investigate the effectiveness of paragraph vector models to compute and utilize document-wise similarities of Turkish news articles. We propose an ensemble categorization approach that consists of three main stages, namely, document processing, paragraph vector learning, and document similarity estimation. Extensive experiments conducted on the TTC-3600 dataset reveal that the proposed system can reach up to 93.5% classification accuracy, which is a remarkable performance when compared to the baseline and state-of-the-art methods. Moreover, it is also shown that the Distributed Bag of Words version of Paragraph Vectors performs better than the Distributed Memory Model of Paragraph Vectors in terms of both accuracy and computational performance.
Turkish news categorization Text classification Neural embeddings Paragraph vectors Document similarity
Primary Language | English |
---|---|
Subjects | Engineering |
Journal Section | Articles |
Authors | |
Publication Date | March 29, 2023 |
Published in Issue | Year 2023 Volume: 24 Issue: 1 |