CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS

Ayşe Berna Altınel

doi:10.21923/jesd.780002

EN TR

CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS

Öz

Automatic text classification and sentiment polarity detection are two important research problems of social media analysis. The meanings of the words are so important that they need to be captured by a document classification algorithm to reach an accurate classification performance. Another important issue with the text classification is the scarcity of labeled data. In this study, Combining Labeled and Unlabeled Data with Semantic Values of Terms (CLUDS) is presented. CLUDS has the following steps: preprocessing, instance labeling, combining labeled and unlabeled data, and prediction. In preprocessing step Latent Dirichlet Allocation (LDA) algorithm is used. In instance labeling step Logistic Regression is applied. In CLUDS, relevance values computation has been applied as a supervised term weighting methodology in the text classification field. Still, according to the literature, CLUDS is the first attempt that uses both relevance and weighting calculation in a semi-supervised semantic kernel for Support Vector Machines (SVM). In this study, Sprinkled-CLUDS and Adaptive-Sprinkled-CLUDS have also been implemented. Evaluated experimental results show that CLUDS, Sprinkled-CLUDS and Adaptive-Sprinkled-CLUDS generate a valuable performance gain over the baseline algorithms on test sets.

Anahtar Kelimeler

CLUDS: SOSYAL MEDYA ANALİZİ İÇİN ETİKETLİ VE ETİKETSİZ VERİLERİ LOJİSTİK REGRESYON İLE BİRLEŞTİRME

Öz

Otomatik metin sınıflandırması ve duygu polarite tespiti, sosyal medya analizinin iki önemli araştırma problemidir. Kelimelerin anlamları o kadar önemlidir ki, doğru bir sınıflandırma performansına ulaşmak için bir belge sınıflandırma algoritması tarafından yakalanmaları gerekir. Metin sınıflandırmasıyla ilgili bir diğer önemli konu, etiketlenmiş verilerin azlığıdır. Bu çalışmada, yeni bir yarı denetimli metodoloji sunulmuştur. Etiketli ve Etiketlenmemiş Verilerin Anlamsal Terim Değerleri (CLUDS) ile Birleştirilmesi olarak adlandırılır. CLUDS şu adımlara sahiptir: ön işleme, örnek etiketleme, etiketli ve etiketlenmemiş verileri birleştirme ve tahmin. Ön işleme adımında Latent Dirichlet Allocation (LDA) algoritması kullanılmaktadır. Örnek etiketleme adımında Lojistik Regresyon uygulanır. CLUDS'ta, alaka değerleri hesaplaması, metin sınıflandırma alanında denetimli bir terim ağırlıklandırma yöntemi olarak uygulanmıştır. Literatüre göre, CLUDS, Destek Vektör Makineleri (SVM) için yarı denetimli bir semantik çekirdekte hem alaka düzeyi hem de ağırlık hesaplamasını kullanan ilk girişimdir. Bu çalışmada, Sprinkled-CLUDS ve Adaptive-Sprinkled-CLUDS da uygulanmıştır. Değerlendirilen deney sonuçları CLUDS, Sprinkled-CLUDS ve Adaptive-Sprinkled-CLUDS'ın test setlerinde temel algoritmalara göre değerli bir performans kazancı sağladığını göstermektedir.

Anahtar Kelimeler

Destekleyen Kurum

TÜBİTAK

Proje Numarası

118E315

Kaynakça

Ahmed, I., Ali, R., Guan, D., Lee, Y., Lee, S., Chung, T. 2015. Semi-Supervised Learning Using Frequent Itemset and Ensemble Learning for SMS Classification. Expert Systems with Applications, 42(3), 1065-1073.
Akın, A. A., & Akın, M. D., 2007. Zemberek, an open source nlp framework for Turkish languages. Structure, 10, 1-5.
Alsmadi, I., & Hoon, G. K., 2019. Term weighting scheme for short-text classification: Twitter corpuses. Neural Computing and Applications, 31(8), 3819-3831.
Altınel, B., Diri, B., Ganiz, M.C., 2015. A Novel Semantic Smoothing Kernel for Text Classification with Class-based Weighting. Knowledge-Based Systems, 89(1), 265-277.
Altınel, B., Ganiz, M. C., 2018. Semantic Text Classification: A Survey of Past and Recent Advances. Information Processing & Management, 54(6), 1129-1153.
Amasyalı, M. F., Beken, A. Türkçe Kelimelerin Anlamsal Benzerliklerinin Ölçülmesi ve Metin Siniflandirmada Kullanilmasi, In Proceedings of IEEE Sinyal İşleme ve İletişim Uygulamalari Kurultayi (SIU), 2009.
Amor, B. R. , Vuik, S. I. , Callahan, R. , Darzi, A. , Yaliraki, S. N. , & Barahona, M., 2016. Community detection and role identification in directed networks: Understand- ing the twitter network of the care. data debate. In Dynamic networks and cyber.
Asiaee T, A., Tepper, M., Banerjee, A., & Sapiro, G., 2012. If you are happy and you know it... tweet. In Proceedings of the 21st ACM international conference on Information and knowledge management, 1602-1606.

Bai, X., Padman, R., Airoldi, E., 2004. Sentiment Extraction From Unstructured Text Using Tabu Search-Enhanced Markov Blanket. Carnegie Mellon University, School of Computer Science [Institute for Software Research International].
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H. Greedy Layer-Wise Training of Deep Networks, 2007. Advances in Neural Information Processing Systems, 19(1), 153-160.
Biricik, G., Diri, B., Sönmez, A. C., 2009. A New Method for Attribute Extraction with Application on Text Classification, Soft Computing. Computing with Words and Perceptions in System Analysis, Decision and Control (ICSCCW), Fifth IEEE International Conference 2009, 1-4.
Biricik, G., Diri, B., Sönmez, A. C., 2012. Abstract Feature Extraction for Text Classification. Turkish Journal of Electrical Engineering & Computer Sciences, 2012, 20(1), 1137-1159.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Bloehdorn, S., Moschitti, A., 2007. Combined Syntactic and Semantic Kernels for Text Classification, Springer, 307-318.
Bordes, A., Glorot, X., Weston, J., Bengio, Y., 2012. Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing. In Proceedings of International Conference on Artificial Intelligence and Statistics, 127–135.
Blum, A. and Mitchell, T., 1998. Semi-Supervised Learning Literature Survey, In Proceedings of Conf. on Computational Learning Theory, 92-100.
Chakraborti, S., Lothian, R., Wiratunga, N., Watt, S. Sprinkling: Supervised Latent Semantic Indexing. In European Conference on Information Retrieval 2006, 510-514. Springer Berlin Heidelberg.
Chakraborti, S., Mukras, R., Lothian, R., Wiratunga, N., Watt, S. N., Harper, D. J. Supervised Latent Semantic Indexing Using Adaptive Sprinkling. In Proceedings of International Joint Conferences on Artificial Intelligence Organization (IJCAI), 2007, 7(1), 1582-1587.
Chapelle, O. and Zien, A., 2005. Semi-Supervised Classification by Low Density Separation, In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 57-64.
Chalothom, T., & Ellman, J., 2015. Simple approaches of sentiment analysis via ensemble learning. In information science and applications (pp. 631-639). Springer, Berlin, Heidelberg.
Chen, J., Huang, H., Tian, S., Qu, Y., 2009. Feature Selection for Text Classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435.
Cho, Y. , Hwang, J. , & Lee, D., 2012. Identification of effective opinion leaders in the diffusion of technological innovation: A social network approach. Technological Forecasting and Social Change, 79 (1), 97–106.
Dahl, G., Ranzato, M., Mohamed, A-R., Hinton, GE., 2010. Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine. In: Advances in Neural Information Processing Systems. Curran Associates, 469–477.
Dahl, G., Yu, D., Deng, L., Acero, A., 2012. Context-Dependent Pre-trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Transactions of Audio Speech Language Processing, 20(1), 30–42.
Denecke, K., 2008. Using sentiwordnet for multilingual sentiment analysis. In 2008 IEEE 24th International Conference on Data Engineering Workshop, 507-512. IEEE.
Ferrara, E., Varol, O., Davis, C., Menczer, F., & Flammini, A., 2014. The rise of social bots. arXiv preprint arXiv: 1407.5225.
Fung, B.C.M., 2003. Hierarchical Document Clustering Using Frequent Itemsets, In Proceedings of International Conference on Data Mining, 59-70.
Graham, S., Weingart, S., & Milligan, I., 2012. Getting started with topic modeling and MALLET. The Editorial Board of the Programming Historian.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B., 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Processing Magazine, 29(6), 82-97.
Hinton, G., Osindero, S., Teh, Y-W., 2006. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18(7):1527–1554.
Hu, X., Tang, J., & Liu, H., 2014a. Online social spammer detection. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
Hu, X., Tang, J., Gao, H., & Liu, H., 2014b. Social Spammer Detection with Sentiment Information. In 2014 IEEE International Conference on Data Mining (pp. 180-189). IEEE.
Hu, Y., Yi, Y., Yang, T., & Pan, Q., 2018. Short Text Classification with Convolutional Neural Networks Based Method. In 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV) (pp. 1432-1435). IEEE.
Injadat, M., Salo, F., & Nassif, A. B., 2016. Data mining techniques in social media: A survey. Neurocomputing, 214, 654-670.
Kalchbrenner, N., Grefenstette, E. and Blunsom, P., 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
Kamber, I.H., Frank, E. Data Mining: Practical Machine Learning Tools And Techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
Kempe, D., Kleinberg, J., & Tardos, É., 2003. Maximizing the spread of influence through a social network. In Proceedings of the ninth acm sigkdd international conference on knowledge discovery and data mining (pp. 137–146). ACM.
Khan, F. H., Qamar, U., & Bashir, S., 2016. SentiMI: Introducing point-wise mutual information with SentiWordNet to improve sentiment polarity detection. Applied Soft Computing, 39, 140-153.
Koehler, M., Greenhalgh, S., & Zellner, A., 2015. Potential Applications of Sentiment Analysis in Educational Research and PracticeIs SITE the Friendliest Conference?. In Society for Information Technology & Teacher Education International Conference (pp. 1348-1354). Association for the Advancement of Computing in Education (AACE).
Krizhevsky A., Sutskever, I., Hinton, G., 2012. Imagenet Classification with Deep Convolutional Neural Networks.In: Advances in Neural Information Processing Systems. Curran Associates, 25(1), 1106–1114.
Lan, M., Tan, C. L., Su, J., Lu, Y. 2009. Supervised and Traditional Term Weighting Methods for Automatic Text Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721-735.
Liu YY, Yang M, Ramsay M, Li XS, Coid JW (2011) A comparison of logistic regression, classification and regression tree, and neural networks models in predicting violent re-offending. J Quant Criminol 27(4):547–553.
Luo, L., Yang, Y., Chen, Z., & Wei, Y., 2018. Identifying opinion leaders with improved weighted LeaderRank in online learning communities. International Journal of Performability Engineering, 14(2), 193-201.
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., and Khudanpur, S., 2011. Recurrent Neural Network Based Language Model, In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 045–1048.
Mishne, G. and Glance, NS, 2006. Predicting movie sales from blogger sentiment,” in AAAI 2006 Spring Symposium on Computational Approaches to Analyzing Weblogs.
Moore, A. Support Vector Machines, Tutorial slides, http://www.cs.cmu.edu/~awm, 2003.
Muslea, I., Minton, S., Knoblock, C.A., 2002. Active Semi-Supervised Learning In Robust Multi-View Learning. In Proceedings of the Nineteenth International Conference on Machine Learning.
Nakagawa, T. Inui, K. and Kurohashi, S., 2010. Dependency tree-based sentiment classification using CRFs with hidden variables. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 786–794. Association for Computational Linguistics.
Nigam, K., McCallum, A. K., Thrun, S., Mitchell, T., 2000. Text Classification From Labeled And Unlabeled Documents Using EM, Machine Learning, 39(2/3), 103-134.
Nigam, K., Ghani, R., 2000b. Analyzing the Effectiveness and Applicability of Co-Training. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management, Washington, DC, 86–93.
Pang, B., Lee, L., & Vaithyanathan, S., 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 (pp. 79-86). Association for Computational Linguistics.
Papka, R., Allan, J., 1998. Document Classification Using Multiword Features, In Proceedings of the Seventh International Conference on Information and Knowledge Management Table of Contents, Bethesda, Maryland, United States, 124–131.
Peng, F., Schuurmans, D., 2003. Combining Naive Bayes and n-Gram Language Models for Text Classification. In European Conference on Information Retrieval, 335-350. Springer Berlin Heidelberg.
Peng, Q., & Zhong, M., 2014. Detecting Spam Review through Sentiment Analysis. JSW, 9(8), 2065-2072.
Razon, A. R., Barnden, J. A., 2015. A New Approach to Automated Text Readability Classification based on Concept Indexing with Integrated Part-of-Speech n-Gram Features. Recent Advances in Natural Language Processing, 521-528.
Reborto, D. S., C., 2012 Kernel Functions for Machine Learning Applications, http://crsouza.com.
Rosenberg, C. et al., 2005. Semi-Supervised Self-Training of Object Detection Models, In Proc. 7th Workshop on Applications of Computer Vision, (1), 29-36.
Salah, Z., Al-Ghuwairi, A. R. F., Baarah, A., Aloqaily, A., Qadoumi, B. A., Alhayek, M., & Alhijawi, B., 2019. A systematic review on opinion mining and sentiment analysis in social media. International Journal of Business Information Systems, 31(4), 530-554.
Seide, F., Li, G., Yu, D., 2011. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. In Proceedings of International Symposium on Computer Architecture, 437–440.
Shinnou, H., Xiao, L., Sasaki, M., Komiya, K., 2015. Hybrid Method of Semi-supervised Learning and Feature Weighted Learning for Domain Adaptation of Document Classification, In Proceeding of the 29th Pacific Asia Conference on Language, Information and Computation, 496-503.
Silva, J., Coheur, L. Mendes, A.C. and Wichert, A., 2011. From symbolic to sub-symbolic information in question classification. Artificial Intelligence Review, 35(2):137–154.
Song, G., Ye, Y., Du, X., Huang, X., Bie, S., 2014. Short Text Classification: A survey, Journal of Multimedia, 9/5, 635-643.
Ucan, A., Naderalvojoud, B., Akcapinar Sezer, E. and Sever, H., 2016. SentiWordNet for New Language: Automatic Translation Approach. 12th International Conference on Signal-Image Technology & Internet-Based Systems.
Uysal, A. K., Gunal, S., 2014. Text Classification Using Genetic Algorithm Oriented Latent Semantic Features. Expert Systems with Applications, 41(13), 5938-5947.
Van Eck, P. S., Jager, W., & Leeflang, P. S., 2011. Opinion leaders’ role in innovation diffusion: A simulation study. Journal of Product Innovation Management, 28(2), 187-203.
Wang, P., Xu, B., Xu, J., Tian, G., Liu, C. L., & Hao, H., 2016. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing, 174, 806-814.
Wang, S. and Manning, C. ,2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94. Association for Computational Linguistics.
Yardi, S., Romero, D., & Schoenebeck, G., 2009. Detecting spam in a twitter network. First Monday, 15(1).
Yarowsky, D., 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 189–196.
Zeng, J., Li, J., Song, Y., Gao, C., Lyu, M. R., & King, I., 2018. Topic memory networks for short text classification. arXiv preprint arXiv:1809.03664.
Zhao, Y. , Li, S. , & Jin, F., 2016a. Identification of influential nodes in social net- works with community structure based on label propagation. Neurocomputing, 210, 34–44.
Zhao, Q. , Erdogdu, M. A. , He, H. Y. , Rajaraman, A. , & Leskovec, J., 2015. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data min.
Zhou, X., Zhang, X., Hu, X., 2008. Semantic Smoothing for Bayesian Text Classification with Small Training Data. In Proceedings of International Conference on Data Mining, 289-300.
Zhu, X. J., 2005. Semi-supervised Learning Literature Survey, Technical Report, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Bilgisayar Yazılımı

Bölüm

Araştırma Makalesi

Yazarlar

Ayşe Berna Altınel ^*
0000-0001-5544-0925
Türkiye

Yayımlanma Tarihi

20 Aralık 2021

Gönderilme Tarihi

13 Ağustos 2020

Kabul Tarihi

6 Eylül 2021

Yayımlandığı Sayı

Yıl 2021 Cilt: 9 Sayı: 4

DOI

https://doi.org/10.21923/jesd.780002

IZ

https://izlik.org/JA32FC37DE

Kaynak Göster

RIS / Bibtex

APA

Altınel, A. B. (2021). CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS. Mühendislik Bilimleri ve Tasarım Dergisi, 9(4), 1048-1061. https://doi.org/10.21923/jesd.780002

AMA

1.Altınel AB. CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS. MBTD. 2021;9(4):1048-1061. doi:10.21923/jesd.780002

Chicago

Altınel, Ayşe Berna. 2021. “CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS”. Mühendislik Bilimleri ve Tasarım Dergisi 9 (4): 1048-61. https://doi.org/10.21923/jesd.780002.

EndNote

Altınel AB (01 Aralık 2021) CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS. Mühendislik Bilimleri ve Tasarım Dergisi 9 4 1048–1061.

IEEE

[1]A. B. Altınel, “CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS”, MBTD, c. 9, sy 4, ss. 1048–1061, Ara. 2021, doi: 10.21923/jesd.780002.

ISNAD

Altınel, Ayşe Berna. “CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS”. Mühendislik Bilimleri ve Tasarım Dergisi 9/4 (01 Aralık 2021): 1048-1061. https://doi.org/10.21923/jesd.780002.

JAMA

1.Altınel AB. CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS. MBTD. 2021;9:1048–1061.

MLA

Altınel, Ayşe Berna. “CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS”. Mühendislik Bilimleri ve Tasarım Dergisi, c. 9, sy 4, Aralık 2021, ss. 1048-61, doi:10.21923/jesd.780002.

Vancouver

1.Ayşe Berna Altınel. CLUDS: COMBINING LABELED AND UNLABELED DATA WITH LOGISTIC REGRESSION FOR SOCIAL MEDIA ANALYSIS. MBTD. 01 Aralık 2021;9(4):1048-61. doi:10.21923/jesd.780002

Cited By

TÜRKÇE KONUŞMADA DUYGU TANIMA İÇİN MAKİNE ÖĞRENME YÖNTEMLERİ VE DERİN ÖĞRENME TABANLI MODELLERİN KARŞILAŞTIRILMASI

Mühendislik Bilimleri ve Tasarım Dergisi

https://doi.org/10.21923/jesd.1350375