Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti

İlhami Sel; Davut Hanbay

doi:10.35234/fumbd.929133

Araştırma Makalesi

Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti

Yıl 2021, Cilt: 33 Sayı: 2, 675 - 684, 15.09.2021

İlhami Sel , Davut Hanbay

https://doi.org/10.35234/fumbd.929133

Cited By: 4

Öz

Yazar profili oluşturma (Author Profiling) bir metnin üslup ve içeriğine bakarak yazarın çeşitli özelliklerinin ortaya çıkarılmasına yönelik bir metin kümesi analizidir. Bu özellikler yaş, cinsiyet, kişilik özellikleri ve hatta meslek gibi unsurları barındırır. Cinsiyet belirleme yazar profili oluşturma çalışmalarının alt alanlarından birisidir. Siber suçlar başta olmak üzere sahte haber yayma gibi adli olayların yanında pazarlama (reklamcılık), sosyolojik ve psikolojik olayların incelenmesinde cinsiyet belirleme oldukça önemlidir. Twitter gönderileri dil kurallarına uymayan, kısaltılmış kelimeler ve anlamsız cümle yapıları da içerme ihtimallerine rağmen cinsiyet belirleme görevi için yaygın bir şekilde kullanılmaktadır. Bu çalışmada Türkçe Twitter gönderilerinden cinsiyet tespiti yapılmaya çalışılmıştır. Problem bir sınıflandırma görevi olarak ele alınmıştır. Yapılan çalışmada makine öğrenmesi metotları(TF-IDF + SVM), derin öğrenme yöntemleri (LSTM, CNN) ve Türkçe için ön eğitimli dil modelleri(BERT, DistilBert, Electra) kullanılmıştır. Yapılan deneyler sonucunda en yüksek başarımı (%80.1) kelime boyutunun 128k olduğu Bert modeli sağlamıştır. Bu çalışma diğer metin sınıflandırma görevleri için de detaylı bir çalışma olma özelliği göstermektedir.

Anahtar Kelimeler

Yazar profili oluşturma, cinsiyet tespiti, doğal dil işleme, dil modelleri, metin sınıflandırma.

Kaynakça

[1] F. M. R. Pardo, A. Giachanou, B. Ghanem, and P. Rosso, “Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter,” CLEF 2020 Labs Work. Noteb. Pap., pp. 22–25, 2020, [Online]. Available: CEUR-WS.org.
[2] M. A. Álvarez-Carmona et al., “A visual approach for age and gender identification on Twitter,” J. Intell. Fuzzy Syst., vol. 34, no. 5, pp. 3133–3145, 2018, doi: 10.3233/JIFS-169497.
[3] F. Rangel, P. Rosso, M. Montes-Y-Gómez, M. Potthast, and B. Stein, “Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter,” CEUR Workshop Proc., vol. 2380, 2018.
[4] E. Sezerer, O. Polatbilek, and S. Tekir, “A Turkish Dataset for Gender Identification of Twitter Users,” pp. 203–207, 2019, doi: 10.18653/v1/w19-4023.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” no. Mlm, 2018, [Online]. Available: http://arxiv.org/abs/1810.04805.
[6] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” arXiv, pp. 2–6, 2019.
[7] C. D. Manning, “Electra : P Re - Training T Ext E Ncoders As D Iscriminators R Ather T Han G Enerators,” Iclr, pp. 1–18, 2020, [Online]. Available: https://github.com/google-research/.
[8] F. Rangel, P. Rosso, M. Potthast, and B. Stein, “Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in Twitter,” CEUR Workshop Proc., vol. 1866, 2017.
[9] F. Rangel and P. Rosso, “Overview of the 7th author profiling task at Pan 2019: Bots and gender profiling in twitter,” CEUR Workshop Proc., vol. 2380, 2019.
[10] W. Zaghouani and A. Charfi, “Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification,” arXiv, pp. 694–700, 2018.
[11] M. Talebi and C. Köse, “Facebook yorumlarının analiziyle Cinsiyet, Yaş ve Eğitim düzeyi belirleme Identifying Gender, Age and Education level by analyzing comments on Facebook,” Ieee, no. 2007, pp. 4–7, 2013, [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6531599.
[12] L. I. Qian et al., “A Survey on Text Classification: From Shallow to Deep Learning,” arXiv, vol. 31, no. 11, pp. 1–21, 2020.
[13] İ. Sel, A. Karci, and D. Hanbay, “Karşılıklı Bilgi Kullanılarak Metin Sınıflandırma İçin Özellik Seçimi Feature Selection for Text Classification Using Mutual Information,” 2019 Int. Artif. Intell. Data Process. Symp., pp. 18–21, 2019.
[14] İ. Sel and D. Hanbay, “Doğal Dil İşleme Yöntemleri Kullanarak E- Maillerin Sınıflandırılması E- Mail Classification Using Natural Language Processing,” 2019 27th Signal Process. Commun. Appl. Conf., pp. 19–22, 2019, [Online]. Available: https://doi.org/10.1109/SIU.2019.8806593.
[15] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–15, 2015.
[16] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[17] J. Pizarro, “Using N-grams to detect Bots on Twitter Notebook for PAN at CLEF 2019,” CEUR Workshop Proc., vol. 2380, pp. 9–12, 2019.
[18] S. ElSayed and M. Farouk, “Gender identification for Egyptian Arabic dialect in twitter using deep learning models,” Egypt. Informatics J., vol. 21, no. 3, pp. 159–167, 2020, doi: 10.1016/j.eij.2020.04.001.
[19] S. Schweter, “BERTurk - BERT models for Turkish.” Zenodo, 2020, doi: 10.5281/zenodo.3770924.

Gender Identification from Turkish Tweets Using Pre-Trained Language Models

Yıl 2021, Cilt: 33 Sayı: 2, 675 - 684, 15.09.2021

İlhami Sel , Davut Hanbay

https://doi.org/10.35234/fumbd.929133

Cited By: 4

Öz

Author Profiling is a text set analysis to reveal various characteristics of the author by examining the style and content of a text. These features include factors such as age, gender, personality traits and even profession. Gender identification is one of the subfields of author profile creation. Gender identification is very important in the investigation of marketing (advertising), sociological and psychological events, as well as forensic events such as spreading fake news, especially cybercrime. Twitter posts are widely used for gender identification, although they may include ungrammatical structures, abbreviated words and meaningless sentence structures. In this study, it was attempted to determine gender from Turkish Twitter posts. The problem is handled as a classification task. In the study, machine learning methods (TF-IDF + SVM), deep learning methods (LSTM, CNN) and pre-trained language models for Turkish (BERT, DistilBert, Electra) were used. As a result of the experiments, Bert model with the word size of 128k provided the highest success (80.1%). This study also features as a detailed study for other text classification tasks.

Anahtar Kelimeler

Author profiling, gender identification, natural language processing, language models, text classification

Kaynakça

[1] F. M. R. Pardo, A. Giachanou, B. Ghanem, and P. Rosso, “Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter,” CLEF 2020 Labs Work. Noteb. Pap., pp. 22–25, 2020, [Online]. Available: CEUR-WS.org.
[2] M. A. Álvarez-Carmona et al., “A visual approach for age and gender identification on Twitter,” J. Intell. Fuzzy Syst., vol. 34, no. 5, pp. 3133–3145, 2018, doi: 10.3233/JIFS-169497.
[3] F. Rangel, P. Rosso, M. Montes-Y-Gómez, M. Potthast, and B. Stein, “Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter,” CEUR Workshop Proc., vol. 2380, 2018.
[4] E. Sezerer, O. Polatbilek, and S. Tekir, “A Turkish Dataset for Gender Identification of Twitter Users,” pp. 203–207, 2019, doi: 10.18653/v1/w19-4023.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” no. Mlm, 2018, [Online]. Available: http://arxiv.org/abs/1810.04805.
[6] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” arXiv, pp. 2–6, 2019.
[7] C. D. Manning, “Electra : P Re - Training T Ext E Ncoders As D Iscriminators R Ather T Han G Enerators,” Iclr, pp. 1–18, 2020, [Online]. Available: https://github.com/google-research/.
[8] F. Rangel, P. Rosso, M. Potthast, and B. Stein, “Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in Twitter,” CEUR Workshop Proc., vol. 1866, 2017.
[9] F. Rangel and P. Rosso, “Overview of the 7th author profiling task at Pan 2019: Bots and gender profiling in twitter,” CEUR Workshop Proc., vol. 2380, 2019.
[10] W. Zaghouani and A. Charfi, “Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification,” arXiv, pp. 694–700, 2018.
[11] M. Talebi and C. Köse, “Facebook yorumlarının analiziyle Cinsiyet, Yaş ve Eğitim düzeyi belirleme Identifying Gender, Age and Education level by analyzing comments on Facebook,” Ieee, no. 2007, pp. 4–7, 2013, [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6531599.
[12] L. I. Qian et al., “A Survey on Text Classification: From Shallow to Deep Learning,” arXiv, vol. 31, no. 11, pp. 1–21, 2020.
[13] İ. Sel, A. Karci, and D. Hanbay, “Karşılıklı Bilgi Kullanılarak Metin Sınıflandırma İçin Özellik Seçimi Feature Selection for Text Classification Using Mutual Information,” 2019 Int. Artif. Intell. Data Process. Symp., pp. 18–21, 2019.
[14] İ. Sel and D. Hanbay, “Doğal Dil İşleme Yöntemleri Kullanarak E- Maillerin Sınıflandırılması E- Mail Classification Using Natural Language Processing,” 2019 27th Signal Process. Commun. Appl. Conf., pp. 19–22, 2019, [Online]. Available: https://doi.org/10.1109/SIU.2019.8806593.
[15] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–15, 2015.
[16] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[17] J. Pizarro, “Using N-grams to detect Bots on Twitter Notebook for PAN at CLEF 2019,” CEUR Workshop Proc., vol. 2380, pp. 9–12, 2019.
[18] S. ElSayed and M. Farouk, “Gender identification for Egyptian Arabic dialect in twitter using deep learning models,” Egypt. Informatics J., vol. 21, no. 3, pp. 159–167, 2020, doi: 10.1016/j.eij.2020.04.001.
[19] S. Schweter, “BERTurk - BERT models for Turkish.” Zenodo, 2020, doi: 10.5281/zenodo.3770924.

Toplam 19 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Mühendislik
Bölüm	MBD
Yazarlar	İlhami Sel 0000-0003-0222-7017 Davut Hanbay 0000-0003-2271-7865
Yayımlanma Tarihi	15 Eylül 2021
Gönderilme Tarihi	28 Nisan 2021
Yayımlandığı Sayı	Yıl 2021 Cilt: 33 Sayı: 2

Kaynak Göster

APA	Sel, İ., & Hanbay, D. (2021). Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi, 33(2), 675-684. https://doi.org/10.35234/fumbd.929133
AMA	Sel İ, Hanbay D. Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi. Eylül 2021;33(2):675-684. doi:10.35234/fumbd.929133
Chicago	Sel, İlhami, ve Davut Hanbay. “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”. Fırat Üniversitesi Mühendislik Bilimleri Dergisi 33, sy. 2 (Eylül 2021): 675-84. https://doi.org/10.35234/fumbd.929133.
EndNote	Sel İ, Hanbay D (01 Eylül 2021) Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi 33 2 675–684.
IEEE	İ. Sel ve D. Hanbay, “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”, Fırat Üniversitesi Mühendislik Bilimleri Dergisi, c. 33, sy. 2, ss. 675–684, 2021, doi: 10.35234/fumbd.929133.
ISNAD	Sel, İlhami - Hanbay, Davut. “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”. Fırat Üniversitesi Mühendislik Bilimleri Dergisi 33/2 (Eylül 2021), 675-684. https://doi.org/10.35234/fumbd.929133.
JAMA	Sel İ, Hanbay D. Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi. 2021;33:675–684.
MLA	Sel, İlhami ve Davut Hanbay. “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”. Fırat Üniversitesi Mühendislik Bilimleri Dergisi, c. 33, sy. 2, 2021, ss. 675-84, doi:10.35234/fumbd.929133.
Vancouver	Sel İ, Hanbay D. Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi. 2021;33(2):675-84.

Cited By

A new perspective on social sustainability: examining Amazon workers’ working conditions and protests applying computational methods in social sciences

Fırat Üniversitesi Mühendislik Bilimleri Dergisi

Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti

Öz

Anahtar Kelimeler

Kaynakça

Gender Identification from Turkish Tweets Using Pre-Trained Language Models

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Kaynak Göster

Cited By

A new perspective on social sustainability: examining Amazon workers’ working conditions and protests applying computational methods in social sciences

Discover Sustainability

https://doi.org/10.1007/s43621-024-00737-x

ÖN EĞİTİMLİ DİL MODELLERİYLE DUYGU ANALİZİ

İstanbul Sabahattin Zaim Üniversitesi Fen Bilimleri Enstitüsü Dergisi

https://doi.org/10.47769/izufbed.1312032

Türkçe Sosyal Medya Mesajlarından Kullanıcıların Yaş ve Cinsiyetini Tahmin Etme

Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi

https://doi.org/10.28948/ngumuh.1191719

Naive Bayes Sınıflandırıcısı Kullanılarak YouTube Verileri Üzerinden Çok Dilli Duygu Analizi

Bilişim Teknolojileri Dergisi

https://doi.org/10.17671/gazibtd.999960