Research Article
BibTex RIS Cite

A Study on Twitter User Gender Classification using Feature Selection

Year 2022, Volume: 14 Issue: 3, 204 - 210, 31.12.2022
https://doi.org/10.29137/umagd.1214018

Abstract

In today's business models, institutions or organizations want to know users’ opinions to improve their decision-making processes. Millions of people all around the world express their daily comments and thoughts using text messages, videos, or photos via social network applications. The rapid growth of social networking applications such as Facebook, Instagram, Twitter, and YouTube provides an attractive field for researchers to investigate the content of big data shared here and analyze user behavior. This enormous amount of data from social networks is used for effective marketing, personalized recommendation systems, finding opinion leaders, the pharmaceutical industry, or political policy making. A big amount of data obtained through social network applications is analyzed by machine learning methods. In this study, feature selection method is used to improve the automatic gender classification performance of Twitter users. The performance of the feature selection method that is applied on three datasets: user descriptions, tweets and where both are used together is evaluated with naive bayes and logistic regression classifiers. The results of the experiments show that the classification success of the selected features using chi-square feature selection method is much better with logistic regression classifier.

Thanks

Bu çalışma ICSAR 2022 (1st International Conference on Scientific and Academic Research) konferansında sunulmuştur.

References

  • Daneshvar, S., ve Inkpen, D. (2018). Gender identification in twitter using n-grams and lsa. Paper presented at the proceedings of the ninth international conference of the CLEF association (CLEF 2018).
  • Han, J., ve Kamber, M. (2006). Data Mining: Concepts and Techniques (Second ed.): The Morgan Kaufmann Series in Data Management Systems.
  • Jin, C., Ma, T., Hou, R., Tang, M., Tian, Y., Al-Dhelaan, A., ve Al-Rodhaan, M. (2015). Chi-square statistics feature selection based on term frequency and distribution for text categorization. IETE journal of research, 61(4), 351-362.
  • Kaggle. (2016). Twitter User Gender Classification. Retrieved from https://www.kaggle.com/datasets/crowdflower/twitter-user-gender-classification?select=gender-classifier-DFE-791531.csv
  • Khandelwal, A., Swami, S., Akhtar, S. S., ve Shrivastava, M. (2018). Gender Prediction in English-Hindi Code-Mixed Social Media Content: Corpus and Baseline System. Computacion Y Sistemas, 22(4), 1241-1247. doi:10.13053/CyS-22-4-3061
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Dubourg, V. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
  • Rangel, F., ve Rosso, P. (2019). Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter. Paper presented at the Proceedings of the CEUR Workshop, Lugano, Switzerland.
  • Rangel, F., Rosso, P., Montes-y-Gómez, M., Potthast, M., ve Stein, B. (2018). Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. Working notes papers of the CLEF, 1-38.
  • Rangel, F., Rosso, P., Potthast, M., ve Stein, B. (2017). Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working notes papers of the CLEF, 1613-0073.
  • Sokolova, M., ve Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427-437. doi:10.1016/j.ipm.2009.03.002
  • Valencia, A. I. V., Adorno, H. G., Rhodes, C. S., ve Pineda, G. F. (2019). Bots and gender identification based on stylometry of tweet minimal structure and n-grams model. Paper presented at the Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland.
  • Vashisth, P., ve Meehan, K. (2020). Gender classification using twitter text data. Paper presented at the 2020 31st Irish Signals and Systems Conference (ISSC).
  • Vicente, M., Batista, F., ve Carvalho, J. P. (2015). Twitter gender classification using user unstructured information. Paper presented at the 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).
  • Vicente, M., Batista, F., ve Carvalho, J. P. (2019). Gender detection of Twitter users based on multiple information sources. In Interactions between computational intelligence and mathematics part 2 (pp. 39-54): Springer.
  • Yang, Y. C., Al-Garadi, M. A., Love, J. S., Perrone, J., ve Sarker, A. (2021). Automatic gender detection in Twitter profiles for health-related cohort studies. Jamia Open, 4(2). doi:10.1093/jamiaopen/ooab042

Nitelik Seçimi Kullanarak Twitter Kullanıcısının Cinsiyet Sınıflandırması üzerine Bir Çalışma

Year 2022, Volume: 14 Issue: 3, 204 - 210, 31.12.2022
https://doi.org/10.29137/umagd.1214018

Abstract

Günümüz iş modellerinde kurum veya kuruluşlar, karar alma süreçlerini iyileştirmek için kullanıcıların görüşlerini bilmek istemektedir. Dünyanın dört bir yanındaki milyonlarca insan, sosyal ağ uygulamaları aracılığıyla metin mesajları, videolar veya fotoğraflar kullanarak günlük yorumlarını ve düşüncelerini ifade etmektedir. Facebook, Instagram, Twitter ve YouTube gibi sosyal ağ uygulamalarının hızla büyümesi, burada paylaşılan büyük verilerin içeriğini araştırmak ve kullanıcı davranışlarını analiz etmek için araştırmacılara çekici bir alan sunmaktadır. Sosyal ağlardan gelen bu muazzam miktardaki veri, etkili pazarlama, kişiselleştirilmiş öneri sistemleri, fikir liderleri bulma, ilaç endüstrisi veya politik analizler için kullanılmaktadır. Sosyal ağ uygulamaları aracılığıyla elde edilen büyük miktarda veri, makine öğrenme yöntemleriyle analiz edilmektedir. Bu çalışmada Twitter kullanıcılarının otomatik cinsiyet sınıflandırması performansını artırmak için nitelik seçim yöntemi kullanılmıştır. Twitter kullanıcı tanımları, twit metinleri ve her ikisinin bir arada kullanıldığı üç veri kümesi üzerinde uygulanan nitelik seçim yönteminin performansı naive bayes ve lojistik regresyon sınıflayıcıları ile değerlendirilmiştir. Deney sonuçları ki-kare nitelik seçim yöntemi ile seçilen niteliklerin lojistik regresyon ile sınıflandırma başarısının çok daha üstün olduğunu göstermektedir.

References

  • Daneshvar, S., ve Inkpen, D. (2018). Gender identification in twitter using n-grams and lsa. Paper presented at the proceedings of the ninth international conference of the CLEF association (CLEF 2018).
  • Han, J., ve Kamber, M. (2006). Data Mining: Concepts and Techniques (Second ed.): The Morgan Kaufmann Series in Data Management Systems.
  • Jin, C., Ma, T., Hou, R., Tang, M., Tian, Y., Al-Dhelaan, A., ve Al-Rodhaan, M. (2015). Chi-square statistics feature selection based on term frequency and distribution for text categorization. IETE journal of research, 61(4), 351-362.
  • Kaggle. (2016). Twitter User Gender Classification. Retrieved from https://www.kaggle.com/datasets/crowdflower/twitter-user-gender-classification?select=gender-classifier-DFE-791531.csv
  • Khandelwal, A., Swami, S., Akhtar, S. S., ve Shrivastava, M. (2018). Gender Prediction in English-Hindi Code-Mixed Social Media Content: Corpus and Baseline System. Computacion Y Sistemas, 22(4), 1241-1247. doi:10.13053/CyS-22-4-3061
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Dubourg, V. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
  • Rangel, F., ve Rosso, P. (2019). Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter. Paper presented at the Proceedings of the CEUR Workshop, Lugano, Switzerland.
  • Rangel, F., Rosso, P., Montes-y-Gómez, M., Potthast, M., ve Stein, B. (2018). Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. Working notes papers of the CLEF, 1-38.
  • Rangel, F., Rosso, P., Potthast, M., ve Stein, B. (2017). Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working notes papers of the CLEF, 1613-0073.
  • Sokolova, M., ve Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427-437. doi:10.1016/j.ipm.2009.03.002
  • Valencia, A. I. V., Adorno, H. G., Rhodes, C. S., ve Pineda, G. F. (2019). Bots and gender identification based on stylometry of tweet minimal structure and n-grams model. Paper presented at the Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland.
  • Vashisth, P., ve Meehan, K. (2020). Gender classification using twitter text data. Paper presented at the 2020 31st Irish Signals and Systems Conference (ISSC).
  • Vicente, M., Batista, F., ve Carvalho, J. P. (2015). Twitter gender classification using user unstructured information. Paper presented at the 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).
  • Vicente, M., Batista, F., ve Carvalho, J. P. (2019). Gender detection of Twitter users based on multiple information sources. In Interactions between computational intelligence and mathematics part 2 (pp. 39-54): Springer.
  • Yang, Y. C., Al-Garadi, M. A., Love, J. S., Perrone, J., ve Sarker, A. (2021). Automatic gender detection in Twitter profiles for health-related cohort studies. Jamia Open, 4(2). doi:10.1093/jamiaopen/ooab042
There are 15 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Tuba Parlar 0000-0002-8004-6150

Publication Date December 31, 2022
Submission Date December 3, 2022
Published in Issue Year 2022 Volume: 14 Issue: 3

Cite

APA Parlar, T. (2022). Nitelik Seçimi Kullanarak Twitter Kullanıcısının Cinsiyet Sınıflandırması üzerine Bir Çalışma. International Journal of Engineering Research and Development, 14(3), 204-210. https://doi.org/10.29137/umagd.1214018

All Rights Reserved. Kırıkkale University, Faculty of Engineering and Natural Science.