Year 2019, Volume 27, Issue 2, Pages 1028 - 1040 2019-04-01

Selective word encoding for effective text representation

SAVAŞ ÖZKAN [1] , AKIN ÖZKAN [2]

4 6

Determining the category of a text document from its semantic content is highly motivated in the literature and it has been extensively studied in various applications. Also, the compact representation of the text is a fundamental step in achieving precise results for the applications and the studies are generously concentrated to improve its performance. In particular, the studies which exploit the aggregation of word-level representations are the mainstream techniques used in the problem. In this paper, we tackle text representation to achieve high performance in different text classification tasks. Throughout the paper, three critical contributions are presented. First, to encode the word-level representations for each text, we adapt a trainable orderless aggregation algorithm to obtain a more discriminative abstract representation by transforming word vectors to the text-level representation. Second, we propose an effective term-weighting scheme to compute the relative importance of words from the context based on their conjunction with the problem in an end-to-end learning manner. Third, we present a weighted loss function to mitigate the class-imbalance problem between the categories. To evaluate the performance, we collect two distinct datasets as Turkish parliament records (i.e. written speeches of four major political parties including 30731/7683 train and test documents) and newspaper articles (i.e. daily articles of the columnists including 16000/3200 train and test documents) whose data is available on the web. From the results, the proposed method introduces significant performance improvements to the baseline techniques (i.e. VLAD and Fisher Vector) and achieves 0.823 % and 0.878 % true prediction accuracies for the party membership and the estimation of the category of articles respectively. The performance validates that the proposed contributions (i.e. trainable word-encoding model, trainable term-weighting scheme and weighted loss function) significantly outperform the baselines.
Text representation, orderless feature aggregation, trainable relative importance weights
Journal Section Articles
Authors

Author: SAVAŞ ÖZKAN

Author: AKIN ÖZKAN

Bibtex @ { tbtkelektrik574577, journal = {Turkish Journal of Electrical Engineering and Computer Science}, issn = {1300-0632}, eissn = {1303-6203}, address = {TUBITAK}, year = {2019}, volume = {27}, pages = {1028 - 1040}, doi = {}, title = {Selective word encoding for effective text representation}, key = {cite}, author = {ÖZKAN, SAVAŞ and ÖZKAN, AKIN} }
APA ÖZKAN, S , ÖZKAN, A . (2019). Selective word encoding for effective text representation. Turkish Journal of Electrical Engineering and Computer Science, 27 (2), 1028-1040. Retrieved from http://dergipark.org.tr/tbtkelektrik/issue/45636/574577
MLA ÖZKAN, S , ÖZKAN, A . "Selective word encoding for effective text representation". Turkish Journal of Electrical Engineering and Computer Science 27 (2019): 1028-1040 <http://dergipark.org.tr/tbtkelektrik/issue/45636/574577>
Chicago ÖZKAN, S , ÖZKAN, A . "Selective word encoding for effective text representation". Turkish Journal of Electrical Engineering and Computer Science 27 (2019): 1028-1040
RIS TY - JOUR T1 - Selective word encoding for effective text representation AU - SAVAŞ ÖZKAN , AKIN ÖZKAN Y1 - 2019 PY - 2019 N1 - DO - T2 - Turkish Journal of Electrical Engineering and Computer Science JF - Journal JO - JOR SP - 1028 EP - 1040 VL - 27 IS - 2 SN - 1300-0632-1303-6203 M3 - UR - Y2 - 2019 ER -
EndNote %0 Turkish Journal of Electrical Engineering and Computer Science Selective word encoding for effective text representation %A SAVAŞ ÖZKAN , AKIN ÖZKAN %T Selective word encoding for effective text representation %D 2019 %J Turkish Journal of Electrical Engineering and Computer Science %P 1300-0632-1303-6203 %V 27 %N 2 %R %U
ISNAD ÖZKAN, SAVAŞ , ÖZKAN, AKIN . "Selective word encoding for effective text representation". Turkish Journal of Electrical Engineering and Computer Science 27 / 2 (April 2019): 1028-1040.