Research Article
BibTex RIS Cite

Gender and Interest Area Analysis on Twitter using Machine Learning Algorithms

Year 2020, , 187 - 194, 30.11.2020
https://doi.org/10.31590/ejosat.819722

Abstract

Social networks have become popular platforms that help people to connect with each other. In addition to individuals, companies and institutions are also interested in social networks for several reasons such as promoting and marketing their products or getting feedback on a specific topic. The goal of companies and institutions is to ensure that people are not targeted by unnecessary information except for products and areas they are interested in. To achieve their business goals, companies and institutions would like to determine the gender of a person who shares the post and the interest area a social media post is related to. Using this information, they carry out various studies to reach their target audiences. In this study, we analyze tweets to identify the genders of Twitter users and interest areas tweets are related to. We develop an application that uses the Twitter Application Programming Interface (API). We collect data using this application to create two different training sets: the gender determination training set and the interest area determination training set. For the gender determination training set, we collect tweets without filtering them. For the interest area determination training set, we collect the tweets by filtering them with the help of the sets of keywords that are created separately for each interest area. After collecting the Tweets, we tag them manually with the help of the application in order to facilitate the tagging process. By performing various experiments, after the determination of the attributes, two different training sets were created which are then used in supervised machine learning. Models were created using these training sets for Naive Bayes, K-Nearest Neighbor Algorithm (KNN-K-Nearest Neighbors), C4.5, Support Vector Machines (SVM-Support Vector Machine) and Sequential Minimal Optimization algorithms (SMO-Sequential Minimal Optimization). The performances of the models were evaluated taking into account kappa statistics and accuracy criteria. When the performances of the obtained models were evaluated, among the models created for gender prediction, the lowest success rate was 44.6% with an accuracy of 44.6% and a kappa value of 0.17. While SVM algorithm had the highest performance, SMO algorithm provided 99.9% accuracy and 0.99 kappa value. Likewise, SVM algorithm gave the lowest performance with 47.9% accuracy and 0.37 kappa value among the models created for the area of interest, while the highest performance was achieved by the KNN algorithm with 93.18% accuracy and 0.91 kappa value. It is observed that the accuracy values and kappa values are compatible with each other.

References

  • Parantapa Bhattacharya, Muhammad Bilal Zafar, Niloy Ganguly, Saptarshi Ghos, Krishna P. Gummadi, “Inferring User interests in the Twitter Social Network”, In: Proceedings of the 8th ACM conference on recommender systems. ACM, pp 357–360, 2019.
  • Mounica Arroju, Aftab Hassan, Golnoosh Farnadi, “Age, Gender and Personality Recognation using Tweets in a Multilingual Settings”, in CLEF 2015 working notes, Toulouse, France, 2015.
  • Zach Wood-Doughty, Nicholas Andrews, Rebecca Marvin, Mark Dredze, “Predicting Twitter User Demographics from Names Alone”, Association for Computational Linguistics, USA, pp. 105-111, 2018.
  • J.V.P.S Avinash and Rakshith Muniraju and Shreyas Shaligraman, “Gender Classification using Twitter Feeds”, CS-552-Advanced Data Mining – Final Project, Illinois Institute of Technology, Chicago, 2017.
  • Mohsen Sayyadiharikandeh, Giovanni Luca Ciampaglia, Alessandro Flammini, “Cross-domain gender detection in Twitter”, Indiana University, School of Informatics and Computing, USA, pp. 39-54, 2019.
  • A. McCallum, and K. Nigam, “A comparison of event models for Naïve Bayes text classification,” AAAI-98 Workshop on Learning for Text Categorization, pp. 41–48, 1998.
  • C. Cortes, and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, pp. 273-297, 1995.
  • S. Zhang1, M. Zong1, X. Zhu1, X. Li2, R. Wang3, “Efficient KNN Classification With Different Numbers of Nearest Neighbors”, IEEE Transactions on Neural Networks and Learning Systems, vol. 29, Issue: 5, May 2018.
  • Bhattacharya P, Bilal Zafar M, Ganguly N, Ghosh S, Gummad P.K, “Inferring User Interest in Twitter Social Network”, IIT Kharagpur MPI-SWS, Germany, pp. 6-10, 2014.
  • Efstratios Kontopoulos, Christos Berberidis, Theologos Dergiades Nick Bassiliades, “Ontology-based sentiment analysis of twitter posts”, vol. 40, Issue 10, pp. 4065-4074, August 2013.
  • Sitaram Asur, Bernardo A. Huberman, “Predicting the Future With Social Media”, 11636305, IEEE, Toronto, ON, Canada, 01 November 2010.
  • Aron Culotta, "Towards detecting influenza epidemics by analyzing Twitter messages", Department of Computer Science, Southeastem Louisiana University, Hammond, LA 70402, 2010
  • Juan M. Soler, Fernando Cuartero, Manuel Roblizo, "Twitter as a Tool for Predicting Elections Results", IEEE, Istanbul, Turkey, 04 February 2013.
  • Shereen Hussein, Mona Farouk, ElSayed Hemayed,"Gender identification of egyptian dialect in twitter", vol. 20, Issue 2, pp. 109-116, July 2019.
  • Edy Budiman, Haviluddin, Nataniel Dengan, Awang Harsa Kridalaksana1, Masna Wati, Purnawansyah, “Performance of Decision Tree C4.5 Algorithmin Student Academic Evaluation”, Faculty of Computer Science and Information Technology, Mulawarman University, Samarinda, Indonesia, Computational Science and Technology, pp. 380-389, 2018.
  • S.S. Keerthi, E.G. Gilbert, “Convergence of Generalized SMO Algorithm for SVM Classsifier Design”, Dept. Of Mechanical and Production Engineering University of Singapore, pp. 351–360, 2002.

Twitter Platformunda Makine Öğrenmesi Algoritmalarıyla Cinsiyet ve İlgi Analizi

Year 2020, , 187 - 194, 30.11.2020
https://doi.org/10.31590/ejosat.819722

Abstract

Twitter gibi sosyal ağlar, insanların iletişim kurması için popüler bir platform haline gelmiştir. Bireysel kullanıcıların yanı sıra kurumlar ve şirketler de ürün tanıtımı, pazarlama ya da herhangi bir konu hakkında geri bildirim alma gibi daha birçok nedenden dolayı bu sahaya ilgi duymaktadır. Kurumların ve şirketlerin hedefi, kişilerin ilgilendikleri ürün ve alanlar dışında gereksiz bilgiler ile rahatsız edilmemesini sağlamaktır. Bunun için de kurum ve şirketler, paylaşım yapanın kadın veya erkek oluşu, tweetin ilgili olduğu alan gibi bilgilere ihtiyaç duymakta ve bu bilgilere bağlı olarak, kendi hedef kitlelerine ulaşmak için çeşitli çalışmalar yapmaktadır. Bu çalışmada Twitter’da üretilen içeriklerden yola çıkılarak, paylaşım yapanın cinsiyeti ve paylaşılan tweetin ilgi alanı için tahmin yapılmıştır. Bu amaçla, Twitter Uygulama Programlama Arayüzü (API- Application Programming Interface) kullanan bir uygulama geliştirilmiştir. Bu uygulama kullanılarak, iki farklı eğitim seti oluşturmaya yönelik veriler toplanmıştır. Cinsiyet tespitine yönelik eğitim seti için, tweetler filtreleme yapılmadan toplanmıştır. İlgi alanı tespitine yönelik eğitim seti için, tweetler farklı ilgi alanları için belirlenmiş anahtar kelime kümeleri yardımıyla, filtreleme yapılarak toplanmıştır. Daha sonra, bu tweetler, etiketleme çalışmasına kolaylık sağlaması amacıyla uygulama kullanılarak el ile etiketlenmiştir. Çeşitli denemeler yapılarak, özniteliklerin belirlenmesinin ardından, gözetimli makine öğrenmesinde kullanılacak iki farklı eğitim seti oluşturulmuştur. Oluşturulan bu eğitim setleri kullanılarak; Naive Bayes, K-En Yakın Komşu Algoritması (KNN- K-Nearest Neighbors), C4.5, Destek Vektör Makineleri (SVM- Support Vector Machine) ve Ardışık Minimal Optimizasyon algoritmaları (SMO- Sequential Minimal Optimization) için modeller oluşturulmuştur. Modellerin başarımı, kappa istatistik ve doğruluk ölçütleri dikkate alınarak değerlendirilmiştir. Elde edilen modellerin başarımları değerlendirildiğinde; cinsiyet tahmini için oluşturulan modeller içinde, en düşük başarıma %44,6 doğruluk ve 0.17 kappa değeri ile SVM algoritması sahipken en yüksek başarımı %99,9 doğruluk ve 0.99 kappa değeri ile SMO algoritması sağlamıştır. Aynı şekilde ilgi alanı için oluşturulan modeller içinde en düşük başarımı %47,9 doğruluk ve 0.37 kappa değeri ile SVM algoritması vermişken en yüksek başarım %93,18 doğruluk ve 0.91 kappa değeri ile KNN algoritması tarafından sağlanmıştır. Doğruluk değerleri ve kappa değerlerinin birbiri ile uyumlu olduğu görülmüştür.

References

  • Parantapa Bhattacharya, Muhammad Bilal Zafar, Niloy Ganguly, Saptarshi Ghos, Krishna P. Gummadi, “Inferring User interests in the Twitter Social Network”, In: Proceedings of the 8th ACM conference on recommender systems. ACM, pp 357–360, 2019.
  • Mounica Arroju, Aftab Hassan, Golnoosh Farnadi, “Age, Gender and Personality Recognation using Tweets in a Multilingual Settings”, in CLEF 2015 working notes, Toulouse, France, 2015.
  • Zach Wood-Doughty, Nicholas Andrews, Rebecca Marvin, Mark Dredze, “Predicting Twitter User Demographics from Names Alone”, Association for Computational Linguistics, USA, pp. 105-111, 2018.
  • J.V.P.S Avinash and Rakshith Muniraju and Shreyas Shaligraman, “Gender Classification using Twitter Feeds”, CS-552-Advanced Data Mining – Final Project, Illinois Institute of Technology, Chicago, 2017.
  • Mohsen Sayyadiharikandeh, Giovanni Luca Ciampaglia, Alessandro Flammini, “Cross-domain gender detection in Twitter”, Indiana University, School of Informatics and Computing, USA, pp. 39-54, 2019.
  • A. McCallum, and K. Nigam, “A comparison of event models for Naïve Bayes text classification,” AAAI-98 Workshop on Learning for Text Categorization, pp. 41–48, 1998.
  • C. Cortes, and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, pp. 273-297, 1995.
  • S. Zhang1, M. Zong1, X. Zhu1, X. Li2, R. Wang3, “Efficient KNN Classification With Different Numbers of Nearest Neighbors”, IEEE Transactions on Neural Networks and Learning Systems, vol. 29, Issue: 5, May 2018.
  • Bhattacharya P, Bilal Zafar M, Ganguly N, Ghosh S, Gummad P.K, “Inferring User Interest in Twitter Social Network”, IIT Kharagpur MPI-SWS, Germany, pp. 6-10, 2014.
  • Efstratios Kontopoulos, Christos Berberidis, Theologos Dergiades Nick Bassiliades, “Ontology-based sentiment analysis of twitter posts”, vol. 40, Issue 10, pp. 4065-4074, August 2013.
  • Sitaram Asur, Bernardo A. Huberman, “Predicting the Future With Social Media”, 11636305, IEEE, Toronto, ON, Canada, 01 November 2010.
  • Aron Culotta, "Towards detecting influenza epidemics by analyzing Twitter messages", Department of Computer Science, Southeastem Louisiana University, Hammond, LA 70402, 2010
  • Juan M. Soler, Fernando Cuartero, Manuel Roblizo, "Twitter as a Tool for Predicting Elections Results", IEEE, Istanbul, Turkey, 04 February 2013.
  • Shereen Hussein, Mona Farouk, ElSayed Hemayed,"Gender identification of egyptian dialect in twitter", vol. 20, Issue 2, pp. 109-116, July 2019.
  • Edy Budiman, Haviluddin, Nataniel Dengan, Awang Harsa Kridalaksana1, Masna Wati, Purnawansyah, “Performance of Decision Tree C4.5 Algorithmin Student Academic Evaluation”, Faculty of Computer Science and Information Technology, Mulawarman University, Samarinda, Indonesia, Computational Science and Technology, pp. 380-389, 2018.
  • S.S. Keerthi, E.G. Gilbert, “Convergence of Generalized SMO Algorithm for SVM Classsifier Design”, Dept. Of Mechanical and Production Engineering University of Singapore, pp. 351–360, 2002.
There are 16 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Enes Günçe 0000-0001-8546-2324

Aydin Carus 0000-0003-3370-5974

Publication Date November 30, 2020
Published in Issue Year 2020

Cite

APA Günçe, E., & Carus, A. (2020). Twitter Platformunda Makine Öğrenmesi Algoritmalarıyla Cinsiyet ve İlgi Analizi. Avrupa Bilim Ve Teknoloji Dergisi187-194. https://doi.org/10.31590/ejosat.819722