Araştırma Makalesi
BibTex RIS Kaynak Göster

Türkçe sosyal medya mesajlarından kullanıcıların yaş ve cinsiyetini tahmin etme

Yıl 2023, , 325 - 333, 15.04.2023
https://doi.org/10.28948/ngumuh.1191719

Öz

Sosyal medya platformları insanların herhangi bir konu hakkındaki fikirlerine dair çok yüksek miktarda veri sunmaktadır. Bu yüzden, bu tip platformlar market analizi ve toplumsal görüş tahmini gibi birçok çalışma için çok önemli veri kaynaklarıdır. Ancak, sosyal medya kullanıcıları bir toplumu tam anlamıyla yansıtmadığından ötürü sosyal medya verisindeki yanlılığı azaltmak için kullanıcıların yaşı ve cinsiyeti gibi çeşitli bilgileri de göz önünde bulundurarak sayma işlemi gibi ek adımların atılması gerekmektedir. Bu çalışmada verilen bir Türkçe Twitter hesabının paylaştığı mesajları kullanarak hesap sahibinin yaş aralığını ve cinsiyetini tahmin etme problemi konusunu ele aldık. Çalışma kapsamında 1040 Twitter kullanıcısının yaş ve cinsiyet bilgilerinden oluşan etiketli bir veri kümesi hazırlanmıştır. Ardından kelime, karakter, retweet, fastText ve BERT tabanlı beş farklı yöntem geliştirilmiştir. Yaptığımız kapsamlı deneylerden kullanıcıların paylaştıkları mesajların insanların yaş ve cinsiyet bilgisine dair önemli ipuçları sunduğunu göstermektedir.

Destekleyen Kurum

Tübitak

Proje Numarası

120E514

Kaynakça

  • Aljazeera, Twitter daily user growth rises as Musk readies to take control, https://www.aljazeera.com /economy/2022/4/28/twitter-daily-user-growth-rises-as-musk-readies-to-take-control, Erişim Tarihi: 29 Mart 2023.
  • Statista, Leading countries based on number of Twitter users as of January 2022, https://www.statista.com /statistics/242606/number-of-active-twitter-users-in-selected-countries/, Erişim Tarihi: 29 Mart 2023.
  • N. Dwi Prasetyo, and C. Hauff, Twitter-based election prediction in the developing world. Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 149-158, Guzelyurt, TRNC, Cyprus, 2015.
  • A. Rashed, M. Kutlu, K. Darwish, T. Elsayed, and C. Bayrak, Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey. Proceedings of the International AAAI Conference on Web and Social Media, pp. 537-548, 2021.
  • P. Suárez-Serrato, M. E. Roberts, C. Davis, and F. Menczer, On the influence of social bots in online protests. International Conference on Social Informatics, pp. 269-278, Bellevue, USA, 2016
  • A. Mislove, S. Lehmann, Y. Y. Ahn, J. P. Onnela, and J. Rosenquist, Understanding the demographics of Twitter users. Proceedings of the International AAAI Conference on Web and Social Media, Vol. 5, No. 1, pp. 554-557, Barcelona, Spain, 2011
  • C. Bayrak M. Kutlu, Predicting Election Results via Social Media: A Case Study for 2018 Turkish Presidential Election. IEEE Transactions on Computational Social Systems. 2022. https://doi.org /10.1109/TCSS.2022.3178052.
  • PAN, Shared Tasks, https://pan.webis.de/shared-tasks.html, Erişim Tarihi: 29 Mart 2023.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, 135-146, 2017. https://doi.org/10.1162 /tacl_a_00051.
  • J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186, Minneapolis, MN, USA, 2019.
  • H. A. Schwartz, J. C. Eichstaedt, M. L. Kern, L. Dziurzynski, S. M. Ramones, M. Agrawal, A. Shah, M. Kosinski, D. Stillwell, M. E. Seligman, and L. H. Ungar, Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 8 (9), e73791, 2013, https://doi.org/10.1371/journal. pone.0073791.
  • K. Santosh, R. Bansal, M. Shekhar, and V. Varma, Author profiling: Predicting age and gender from blogs. Notebook for PAN at CLEF. 2, 2013.
  • W. Deitrick, Z. Miller, B. Valyou, B. Dickinson, T. Munson, and W. Hu, Author Gender Prediction in an Email Stream Using Neural Networks. Journal of Intelligent Learning Systems and Applications, 4, 169-175, 2012, https://doi.org/10.4236/jilsa.2012.43017.
  • R. Alroobaea, A. H. Almulihi, F. S. Alharithi, S. Mechti, M. Krichen, and L. H. Belguith, A Deep Learning Model to Predict Gender, Age and Occupation of the Celebrities based on Tweets Followers. CLEF (Working Notes), Thessaloniki, Greece, 2020.
  • D. Rao, D. Yarowsky, A. Shreevats, and M. Gupta, Classifying latent user attributes in twitter. Proceedings of the 2nd international workshop on Search and mining user-generated contents, pp. 37-44, Toronto, Canada, 2010.
  • L. Flekova, D. Preoţiuc-Pietro, and L. Ungar, Exploring stylistic variation with age and income on twitter. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 313-319, Berlin, Germany, 2016.
  • R. Hirt, N. Kühl, and G. Satzger, Cognitive computing for customer profiling: meta classification for gender prediction. Electronic Markets, 29(1), 93-106, 2019, https://doi.org/10.1007/s12525-019-00336-z.
  • D. Nguyen, R. Gravel, D. Trieschnigg, and T. Meder, "How old do you think I am?" A study of language and age in Twitter. Proceedings of the International AAAI Conference on Web and Social Media, Vol. 7, No. 1, pp. 439-448, Ann Arbor, MI, USA, 2013.
  • G. K. Mikros and K. Perifanos, Authorship attribution in greek tweets using author's multilevel n-gram profiles. AAAI Spring Symposium: Analyzing Microtext. pp. 17-23, 2013.
  • S. Baxevanakis, S. Gavras, D. Mouratidis, and K. L. Kermanidis, A machine learning approach for gender identification of Greek tweet authors. Proceedings of the 13th ACM International Conference on PErvasive Technologies Related to Assistive Environments, pp. 1-4, Corfu, Greece, 2020.
  • K. Alrifai, G. Rebdawi, and N. Ghneim, Arabic Tweeps Gender and Dialect Prediction. CLEF (Working notes). Dublin, Ireland, 2017.
  • M. Wiegmann, B. Stein, and M. Potthast, Celebrity profiling. Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp. 2611-2618, Florence, Italy, 2019.
  • E. Sezerer, O. Polatbilek, and S. Tekir, A Turkish Dataset for Gender Identification of Twitter Users. Proceedings of the 13th Linguistic Annotation Workshop, pp. 203-207, Florence, Italy, 2019.
  • F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, and G. Inches, Overview of the author profiling task at PAN 2013. CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pp. 352-365, Valencia, Spain, 2013.
  • G. Park, D. B. Yaden, H. A. Schwartz, M. L. Kern, J. C. Eichstaedt, M. Kosinski, D. Stillwell, L.H. Ungar, and M. E. Seligman, Women are warmer but no less assertive than men: Gender and language on Facebook. PloS one, 11(5), e0155885. 2016, https://doi.org/10. 1371/journal.pone.0155885.
  • M. L. Newman, C. J. Groom, L. D. Handelman, and J. W. Pennebaker, Gender differences in language use: An analysis of 14,000 text samples. Discourse processes, 45(3), 211-236, 2008, https://doi.org/10. 1080/01638530802073712.
  • J. W. Pennebaker and L. D. Stone, Words of wisdom: language use over the life span. Journal of personality and social psychology, 85(2), 291, 2003, https://doi.org/10.1037/0022-3514.85.2.291.
  • P. M. Brandt and P. Y. Herzberg, Wisdom of words? Age differences in language and social media use in job applications. Current Psychology, 1-11, 2022, https://doi.org/10.1007/s12144-021-02646-y.
  • D. Nguyen, N. A. Smith, and C. Rose, Author age prediction from text using linear regression. Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, pp. 115-123, Portland, OR, USA, 2011.
  • E. Sezerer, O. Polatbilek, Ö. Sevgili, and S. Tekir, Gender prediction from Tweets with convolutional neural networks: Notebook for PAN at CLEF 2018. 19th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2018. CEUR Workshop Proceedings. Avignon, France, 2018
  • S. E. L. İlhami and D. Hanbay, Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi, 33(2), 675-684, 2021. https://doi.org/10. 35234/fumbd.929133.
  • Stefan Schweter, BERTurk - BERT models for Turkish, https://zenodo.org/record/3770924, Erişim Tarihi: 29 Mart 2023.
  • Scikit-Learn, Scikit-Learn Machine Learning in Python, https://scikit-learn.org/stable/index.html, Erişim Tarihi: 29 Mart 2023.
  • NLTK, Natural Language Toolkit, https://www.nltk. org, Erişim Tarihi: 29 Mart 2023.
  • O. Tunçelli, Turkish Stemmer Python, https://github.com/otuncelli/turkish-stemmer-python, Erişim Tarihi: 29 Mart 2023.
  • A. S. Maiya, ktrain, https://github.com/amaiya/ktrain, Erişim Tarihi: 29 Mart 2023.
  • É. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, Learning Word Vectors for 157 Languages. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pp. 3483-3487, Miyazaki, Japan, 2018.
  • F. Rangel, P. Rosso, M. Montes-y-Gómez, M. Potthast, and B. Stein, Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. Working notes papers of the CLEF, pp. 1-38, Avignon, France, 2018.

Predicting users age and gender using Turkish social media messages

Yıl 2023, , 325 - 333, 15.04.2023
https://doi.org/10.28948/ngumuh.1191719

Öz

Social media platforms provide a huge amount of data on people's opinions on any topic. Therefore, such platforms are very important data sources for many studies such as market analysis and social opinion prediction. However, since social media users do not fully reflect a society, it is necessary to take additional steps to reduce bias such as weighted counting based on users' age and gender. In this study, we focus on the problem of predicting the age range and gender of the owner of a given Twitter account using the shared messages in Turkish. Within the scope of the study, we constructed a labeled dataset consisting of age and gender information of 1040 Twitter users. In addition, we developed five different methods based on words, characters, retweets, fastText, and BERT. Our extensive experiments show that the messages shared by users offer important clues about people's age and gender information.

Proje Numarası

120E514

Kaynakça

  • Aljazeera, Twitter daily user growth rises as Musk readies to take control, https://www.aljazeera.com /economy/2022/4/28/twitter-daily-user-growth-rises-as-musk-readies-to-take-control, Erişim Tarihi: 29 Mart 2023.
  • Statista, Leading countries based on number of Twitter users as of January 2022, https://www.statista.com /statistics/242606/number-of-active-twitter-users-in-selected-countries/, Erişim Tarihi: 29 Mart 2023.
  • N. Dwi Prasetyo, and C. Hauff, Twitter-based election prediction in the developing world. Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 149-158, Guzelyurt, TRNC, Cyprus, 2015.
  • A. Rashed, M. Kutlu, K. Darwish, T. Elsayed, and C. Bayrak, Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey. Proceedings of the International AAAI Conference on Web and Social Media, pp. 537-548, 2021.
  • P. Suárez-Serrato, M. E. Roberts, C. Davis, and F. Menczer, On the influence of social bots in online protests. International Conference on Social Informatics, pp. 269-278, Bellevue, USA, 2016
  • A. Mislove, S. Lehmann, Y. Y. Ahn, J. P. Onnela, and J. Rosenquist, Understanding the demographics of Twitter users. Proceedings of the International AAAI Conference on Web and Social Media, Vol. 5, No. 1, pp. 554-557, Barcelona, Spain, 2011
  • C. Bayrak M. Kutlu, Predicting Election Results via Social Media: A Case Study for 2018 Turkish Presidential Election. IEEE Transactions on Computational Social Systems. 2022. https://doi.org /10.1109/TCSS.2022.3178052.
  • PAN, Shared Tasks, https://pan.webis.de/shared-tasks.html, Erişim Tarihi: 29 Mart 2023.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, 135-146, 2017. https://doi.org/10.1162 /tacl_a_00051.
  • J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186, Minneapolis, MN, USA, 2019.
  • H. A. Schwartz, J. C. Eichstaedt, M. L. Kern, L. Dziurzynski, S. M. Ramones, M. Agrawal, A. Shah, M. Kosinski, D. Stillwell, M. E. Seligman, and L. H. Ungar, Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 8 (9), e73791, 2013, https://doi.org/10.1371/journal. pone.0073791.
  • K. Santosh, R. Bansal, M. Shekhar, and V. Varma, Author profiling: Predicting age and gender from blogs. Notebook for PAN at CLEF. 2, 2013.
  • W. Deitrick, Z. Miller, B. Valyou, B. Dickinson, T. Munson, and W. Hu, Author Gender Prediction in an Email Stream Using Neural Networks. Journal of Intelligent Learning Systems and Applications, 4, 169-175, 2012, https://doi.org/10.4236/jilsa.2012.43017.
  • R. Alroobaea, A. H. Almulihi, F. S. Alharithi, S. Mechti, M. Krichen, and L. H. Belguith, A Deep Learning Model to Predict Gender, Age and Occupation of the Celebrities based on Tweets Followers. CLEF (Working Notes), Thessaloniki, Greece, 2020.
  • D. Rao, D. Yarowsky, A. Shreevats, and M. Gupta, Classifying latent user attributes in twitter. Proceedings of the 2nd international workshop on Search and mining user-generated contents, pp. 37-44, Toronto, Canada, 2010.
  • L. Flekova, D. Preoţiuc-Pietro, and L. Ungar, Exploring stylistic variation with age and income on twitter. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 313-319, Berlin, Germany, 2016.
  • R. Hirt, N. Kühl, and G. Satzger, Cognitive computing for customer profiling: meta classification for gender prediction. Electronic Markets, 29(1), 93-106, 2019, https://doi.org/10.1007/s12525-019-00336-z.
  • D. Nguyen, R. Gravel, D. Trieschnigg, and T. Meder, "How old do you think I am?" A study of language and age in Twitter. Proceedings of the International AAAI Conference on Web and Social Media, Vol. 7, No. 1, pp. 439-448, Ann Arbor, MI, USA, 2013.
  • G. K. Mikros and K. Perifanos, Authorship attribution in greek tweets using author's multilevel n-gram profiles. AAAI Spring Symposium: Analyzing Microtext. pp. 17-23, 2013.
  • S. Baxevanakis, S. Gavras, D. Mouratidis, and K. L. Kermanidis, A machine learning approach for gender identification of Greek tweet authors. Proceedings of the 13th ACM International Conference on PErvasive Technologies Related to Assistive Environments, pp. 1-4, Corfu, Greece, 2020.
  • K. Alrifai, G. Rebdawi, and N. Ghneim, Arabic Tweeps Gender and Dialect Prediction. CLEF (Working notes). Dublin, Ireland, 2017.
  • M. Wiegmann, B. Stein, and M. Potthast, Celebrity profiling. Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp. 2611-2618, Florence, Italy, 2019.
  • E. Sezerer, O. Polatbilek, and S. Tekir, A Turkish Dataset for Gender Identification of Twitter Users. Proceedings of the 13th Linguistic Annotation Workshop, pp. 203-207, Florence, Italy, 2019.
  • F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, and G. Inches, Overview of the author profiling task at PAN 2013. CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pp. 352-365, Valencia, Spain, 2013.
  • G. Park, D. B. Yaden, H. A. Schwartz, M. L. Kern, J. C. Eichstaedt, M. Kosinski, D. Stillwell, L.H. Ungar, and M. E. Seligman, Women are warmer but no less assertive than men: Gender and language on Facebook. PloS one, 11(5), e0155885. 2016, https://doi.org/10. 1371/journal.pone.0155885.
  • M. L. Newman, C. J. Groom, L. D. Handelman, and J. W. Pennebaker, Gender differences in language use: An analysis of 14,000 text samples. Discourse processes, 45(3), 211-236, 2008, https://doi.org/10. 1080/01638530802073712.
  • J. W. Pennebaker and L. D. Stone, Words of wisdom: language use over the life span. Journal of personality and social psychology, 85(2), 291, 2003, https://doi.org/10.1037/0022-3514.85.2.291.
  • P. M. Brandt and P. Y. Herzberg, Wisdom of words? Age differences in language and social media use in job applications. Current Psychology, 1-11, 2022, https://doi.org/10.1007/s12144-021-02646-y.
  • D. Nguyen, N. A. Smith, and C. Rose, Author age prediction from text using linear regression. Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, pp. 115-123, Portland, OR, USA, 2011.
  • E. Sezerer, O. Polatbilek, Ö. Sevgili, and S. Tekir, Gender prediction from Tweets with convolutional neural networks: Notebook for PAN at CLEF 2018. 19th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2018. CEUR Workshop Proceedings. Avignon, France, 2018
  • S. E. L. İlhami and D. Hanbay, Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi, 33(2), 675-684, 2021. https://doi.org/10. 35234/fumbd.929133.
  • Stefan Schweter, BERTurk - BERT models for Turkish, https://zenodo.org/record/3770924, Erişim Tarihi: 29 Mart 2023.
  • Scikit-Learn, Scikit-Learn Machine Learning in Python, https://scikit-learn.org/stable/index.html, Erişim Tarihi: 29 Mart 2023.
  • NLTK, Natural Language Toolkit, https://www.nltk. org, Erişim Tarihi: 29 Mart 2023.
  • O. Tunçelli, Turkish Stemmer Python, https://github.com/otuncelli/turkish-stemmer-python, Erişim Tarihi: 29 Mart 2023.
  • A. S. Maiya, ktrain, https://github.com/amaiya/ktrain, Erişim Tarihi: 29 Mart 2023.
  • É. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, Learning Word Vectors for 157 Languages. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pp. 3483-3487, Miyazaki, Japan, 2018.
  • F. Rangel, P. Rosso, M. Montes-y-Gómez, M. Potthast, and B. Stein, Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. Working notes papers of the CLEF, pp. 1-38, Avignon, France, 2018.
Toplam 38 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Bilgisayar Yazılımı
Bölüm Bilgisayar Mühendisliği
Yazarlar

Mustafa Kaan Görgün 0000-0002-2608-2070

Gökçe Başak Demirok 0000-0003-1620-2013

Mucahid Kutlu 0000-0002-5660-4992

Proje Numarası 120E514
Yayımlanma Tarihi 15 Nisan 2023
Gönderilme Tarihi 19 Ekim 2022
Kabul Tarihi 27 Şubat 2023
Yayımlandığı Sayı Yıl 2023

Kaynak Göster

APA Görgün, M. K., Demirok, G. B., & Kutlu, M. (2023). Türkçe sosyal medya mesajlarından kullanıcıların yaş ve cinsiyetini tahmin etme. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, 12(2), 325-333. https://doi.org/10.28948/ngumuh.1191719
AMA Görgün MK, Demirok GB, Kutlu M. Türkçe sosyal medya mesajlarından kullanıcıların yaş ve cinsiyetini tahmin etme. NÖHÜ Müh. Bilim. Derg. Nisan 2023;12(2):325-333. doi:10.28948/ngumuh.1191719
Chicago Görgün, Mustafa Kaan, Gökçe Başak Demirok, ve Mucahid Kutlu. “Türkçe Sosyal Medya mesajlarından kullanıcıların Yaş Ve Cinsiyetini Tahmin Etme”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 12, sy. 2 (Nisan 2023): 325-33. https://doi.org/10.28948/ngumuh.1191719.
EndNote Görgün MK, Demirok GB, Kutlu M (01 Nisan 2023) Türkçe sosyal medya mesajlarından kullanıcıların yaş ve cinsiyetini tahmin etme. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 12 2 325–333.
IEEE M. K. Görgün, G. B. Demirok, ve M. Kutlu, “Türkçe sosyal medya mesajlarından kullanıcıların yaş ve cinsiyetini tahmin etme”, NÖHÜ Müh. Bilim. Derg., c. 12, sy. 2, ss. 325–333, 2023, doi: 10.28948/ngumuh.1191719.
ISNAD Görgün, Mustafa Kaan vd. “Türkçe Sosyal Medya mesajlarından kullanıcıların Yaş Ve Cinsiyetini Tahmin Etme”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 12/2 (Nisan 2023), 325-333. https://doi.org/10.28948/ngumuh.1191719.
JAMA Görgün MK, Demirok GB, Kutlu M. Türkçe sosyal medya mesajlarından kullanıcıların yaş ve cinsiyetini tahmin etme. NÖHÜ Müh. Bilim. Derg. 2023;12:325–333.
MLA Görgün, Mustafa Kaan vd. “Türkçe Sosyal Medya mesajlarından kullanıcıların Yaş Ve Cinsiyetini Tahmin Etme”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, c. 12, sy. 2, 2023, ss. 325-33, doi:10.28948/ngumuh.1191719.
Vancouver Görgün MK, Demirok GB, Kutlu M. Türkçe sosyal medya mesajlarından kullanıcıların yaş ve cinsiyetini tahmin etme. NÖHÜ Müh. Bilim. Derg. 2023;12(2):325-33.

download