Author Identification with Machine Learning Algorithms

İbrahim Yülüce; Feriştah Dalkılıç

Research Article

Year 2022, Volume: 6 Issue: 1, 45 - 50, 20.07.2022

İbrahim Yülüce , Feriştah Dalkılıç

Abstract

Project Number

378

References

Stamatatos, Efstathios. “A survey of modern authorship attribution methods.” Journal of the American Society for information Science and Technology 60.3 (2009): 538-556.
Alhuqail, Noura Khalid, Author Identification Based on NLP (April 6, 2021). European Journal of Computer Science and Information Technology, Vol.9, No.1, pp.1-26, 2021, Available at SSRN: https://ssrn.com/abstract=3820262
Maël Fabien, Esau Villatoro-Tello, Petr Motlicek, and Shantipriya Parida. 2020. “BertAA : BERT fine-tuning for Authorship Attribution.” In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 127–137, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
A. M. Mohsen, N. M. El-Makky and N. Ghanem, "Author Identification Using Deep Learning," 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016, pp. 898-903, doi: 10.1109/ICMLA.2016.0161.
Yunita Sari, Mark Stevenson, and Andreas Vlachos. 2018. Topic or Style? Exploring the Most Useful Features for Authorship Attribution. In Proceedings of the 27th International Conference on Computational Linguistics, pages 343–353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Barlas, G., Stamatatos, E. (2020). Cross-Domain Authorship Attribution Using Pre-trained Language Models. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds) Artificial Intelligence Applications and Innovations. AIAI 2020. IFIP Advances in Information and Communication Technology, vol 583. Springer, Cham. https://doi.org/10.1007/978-3-030-49161-1_22
Ramezani, Reza. “A language-independent authorship attribution approach for author identification of text documents.” Expert Systems with Applications 180 (2021): 115139.
Olga Fourkioti, Symeon Symeonidis, Avi Arampatzis, Language models and fusion for authorship attribution, Information Processing & Management, Volume 56, Issue 6, 2019, 102061, ISSN 0306-4573, https://doi.org/10.1016/j.ipm.2019.102061.
S. Okuno, H. Asai and H. Yamana, "A challenge of authorship identification for ten-thousand-scale microblog users," 2014 IEEE International Conference on Big Data (Big Data), 2014, pp. 52-54, doi: 10.1109/BigData.2014.7004491.
Z. Damiran and K. Altangerel, “Author Identification-An Experiment Based on Mongolian Literature Using Decision Trees.” 2014 7th International Conference on Ubi-Media Computing and Workshops. IEEE, 2014. pp. 186-189.
Ramezani, Reza, Navid Sheydaei, and Mohsen Kahani. “Evaluating the effects of textual features on authorship attribution accuracy.” ICCKE 2013. IEEE, 2013.
H. Paci, E. Kajo, E. Trandafili, I. Tafa and D. Salillari, "Author Identification in Albanian Language," 2011 14th International Conference on Network-Based Information Systems, pp. 425-430.
Pandian, A., V. V. Ramalingam, and R. V. Preet. “Authorship identification for Tamil classical poem (Mukkoodar Pallu) using C4. 5 algorithm.” Indian Journal of Science and Technology 9.46 (2016).
Kale Sunil Digamberrao, Rajesh S. Prasad, Author Identification using Sequential Minimal Optimization with rule-based Decision Tree on Indian Literature in Marathi, Procedia Computer Science, Volume 132, 2018, Pages 1086-1101, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2018.05.024.
Oliveira W Jr, Justino E, Oliveira LS. Comparing compression models for authorship attribution. Forensic Sci Int. 2013 May 10;228(1-3):100-4. doi: 10.1016/j.forsciint.2013.02.025. Epub 2013 Mar 24. PMID: 23597746.
Romanov, Aleksandr & Kurtukova, Anna & Shelupanov, Alexander & Fedotova, Anastasia & Goncharov, Valery. (2020). Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks. Future Internet. 13. 3. 10.3390/fi13010003.
Fedotova, A.; Romanov, A.; Kurtukova, A.; Shelupanov, A. Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection. Future Internet 2022, 14, 4. https://doi.org/10.3390/fi14010004
Sage, M., Cruciata, P., Abdo, R., Cheung, J.C., & Zhao, Y.F. (2020). Investigating the Influence of Selected Linguistic Features on Authorship Attribution using German News Articles. SwissText/KONVENS.
Otoom, Ahmed & Abdallah, Emad & Jaafer, Shifaa & Hamdallh, Aseel & Amer, Dana. (2014). Towards author identification of Arabic text articles. 2014 5th International Conference on Information and Communication Systems, ICICS 2014. 1-4. 10.1109/IACS.2014.6841971.
O. de Vel, A. Anderson, M. Corney, and G. Mohay. 2001. Mining e-mail content for author identification forensics. SIGMOD Rec. 30, 4 (December 2001), 55–64. https://doi.org/10.1145/604264.604272
B. Kırmacı and H. Oğul, "Evaluating text features for lyrics-based songwriter prediction," 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES), 2015, pp. 405-409, doi: 10.1109/INES.2015.7329743.
Upul Bandara, Gamini Wijayarathna, Source code author identification with unsupervised feature learning, Pattern Recognition Letters, Volume 34, Issue 3, 2013, Pages 330-334, ISSN 0167-8655, https://doi.org/10.1016/j.patrec.2012.10.027.
Alonso-Fernandez, Fernando & Belvisi, Nicole & Hernandez-Diaz, Kevin & Muhammad, Naveed & Bigun, Josef. (2021). Writer Identification Using Microblogging Texts for Social Media Forensics. IEEE Transactions on Biometrics, Behavior, and Identity Science. PP. 1-1. 10.1109/TBIOM.2021.3078073.
Akın, Ahmet Afsin, and Mehmet Dündar Akın. “Zemberek, an open source NLP framework for Turkic languages.” Structure 10.2007 (2007): 1-5.
M. S. Atar, E. Esen and M. A. Arabaci, "Supervised author recognition with aggregated word embeddings," 2018 26th Signal Processing and Communications Applications Conference (SIU), 2018, pp. 1-4, doi: 10.1109/SIU.2018.8404464.
Diri, B., and Amasyalı, M. F. (2003, June). “Automatic author detection for Turkish texts.” In Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP) (pp. 138-141).
Örücü F., Dalkılıç G., “Author Identification Using N-grams and SVM”, The 1. International Symposium on Computing in Science & Engineering, ISBN:978-605-61394-0-6 P:130, Kuşadası, 3-5 Haziran 2010
B. Kuyumcu, B. Buluz and Y. Kömeçoğlu, "Author Identification in Turkish Documents with Ridge Regression Analysis," 2019 27th Signal Processing and Communications Applications Conference (SIU), 2019, pp. 1-4, doi: 10.1109/SIU.2019.8806242.
Burcu İlkay KARAMAN, Feriştah DALKILIÇ, Emine Eda ÇAM EKER, “Author Recognition In Modern Turkish For Forensic Linguistic Cases Using Machine Learning”, 1st International, 17th National Forensic Science Congress, 12-15 November 2020, Online.

Author Identification with Machine Learning Algorithms

Year 2022, Volume: 6 Issue: 1, 45 - 50, 20.07.2022

İbrahim Yülüce , Feriştah Dalkılıç

Abstract

Author identification is one of the application areas of text mining. It deals with the automatic prediction of the potential author of an electronic text among predefined author candidates by using author specific writing styles. In this study, we conducted an experiment for the identification of the author of a Turkish language text by using classical machine learning methods including Support Vector Machines (SVM), Gaussian Naive Bayes (GaussianNB), Multi Layer Perceptron (MLP), Logistic Regression (LR), Stochastic Gradient Descent (SGD) and ensemble learning methods including Extremely Randomized Trees (ExtraTrees), and eXtreme Gradient Boosting (XGBoost). The proposed method was applied on three different sizes of author groups including 10, 15 and 20 authors obtained from a new dataset of newspaper articles. Term frequency-inverse document frequency (TF-IDF) vectors were created by using 1-gram and 2-gram word tokens. Our results show that the most successful method is the SGD with a classification performance accuracy of 0.976% by using word unigrams and most successful method is the LR with a classification performance accuracy of 0.935% by using word bigrams.

Keywords

author identification, natural language processing, tf-idf, text mining, machine learning

Project Number

378

References

Stamatatos, Efstathios. “A survey of modern authorship attribution methods.” Journal of the American Society for information Science and Technology 60.3 (2009): 538-556.
Alhuqail, Noura Khalid, Author Identification Based on NLP (April 6, 2021). European Journal of Computer Science and Information Technology, Vol.9, No.1, pp.1-26, 2021, Available at SSRN: https://ssrn.com/abstract=3820262
Maël Fabien, Esau Villatoro-Tello, Petr Motlicek, and Shantipriya Parida. 2020. “BertAA : BERT fine-tuning for Authorship Attribution.” In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 127–137, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
A. M. Mohsen, N. M. El-Makky and N. Ghanem, "Author Identification Using Deep Learning," 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016, pp. 898-903, doi: 10.1109/ICMLA.2016.0161.
Yunita Sari, Mark Stevenson, and Andreas Vlachos. 2018. Topic or Style? Exploring the Most Useful Features for Authorship Attribution. In Proceedings of the 27th International Conference on Computational Linguistics, pages 343–353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Barlas, G., Stamatatos, E. (2020). Cross-Domain Authorship Attribution Using Pre-trained Language Models. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds) Artificial Intelligence Applications and Innovations. AIAI 2020. IFIP Advances in Information and Communication Technology, vol 583. Springer, Cham. https://doi.org/10.1007/978-3-030-49161-1_22
Ramezani, Reza. “A language-independent authorship attribution approach for author identification of text documents.” Expert Systems with Applications 180 (2021): 115139.
Olga Fourkioti, Symeon Symeonidis, Avi Arampatzis, Language models and fusion for authorship attribution, Information Processing & Management, Volume 56, Issue 6, 2019, 102061, ISSN 0306-4573, https://doi.org/10.1016/j.ipm.2019.102061.
S. Okuno, H. Asai and H. Yamana, "A challenge of authorship identification for ten-thousand-scale microblog users," 2014 IEEE International Conference on Big Data (Big Data), 2014, pp. 52-54, doi: 10.1109/BigData.2014.7004491.
Z. Damiran and K. Altangerel, “Author Identification-An Experiment Based on Mongolian Literature Using Decision Trees.” 2014 7th International Conference on Ubi-Media Computing and Workshops. IEEE, 2014. pp. 186-189.
Ramezani, Reza, Navid Sheydaei, and Mohsen Kahani. “Evaluating the effects of textual features on authorship attribution accuracy.” ICCKE 2013. IEEE, 2013.
H. Paci, E. Kajo, E. Trandafili, I. Tafa and D. Salillari, "Author Identification in Albanian Language," 2011 14th International Conference on Network-Based Information Systems, pp. 425-430.
Pandian, A., V. V. Ramalingam, and R. V. Preet. “Authorship identification for Tamil classical poem (Mukkoodar Pallu) using C4. 5 algorithm.” Indian Journal of Science and Technology 9.46 (2016).
Kale Sunil Digamberrao, Rajesh S. Prasad, Author Identification using Sequential Minimal Optimization with rule-based Decision Tree on Indian Literature in Marathi, Procedia Computer Science, Volume 132, 2018, Pages 1086-1101, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2018.05.024.
Oliveira W Jr, Justino E, Oliveira LS. Comparing compression models for authorship attribution. Forensic Sci Int. 2013 May 10;228(1-3):100-4. doi: 10.1016/j.forsciint.2013.02.025. Epub 2013 Mar 24. PMID: 23597746.
Romanov, Aleksandr & Kurtukova, Anna & Shelupanov, Alexander & Fedotova, Anastasia & Goncharov, Valery. (2020). Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks. Future Internet. 13. 3. 10.3390/fi13010003.
Fedotova, A.; Romanov, A.; Kurtukova, A.; Shelupanov, A. Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection. Future Internet 2022, 14, 4. https://doi.org/10.3390/fi14010004
Sage, M., Cruciata, P., Abdo, R., Cheung, J.C., & Zhao, Y.F. (2020). Investigating the Influence of Selected Linguistic Features on Authorship Attribution using German News Articles. SwissText/KONVENS.
Otoom, Ahmed & Abdallah, Emad & Jaafer, Shifaa & Hamdallh, Aseel & Amer, Dana. (2014). Towards author identification of Arabic text articles. 2014 5th International Conference on Information and Communication Systems, ICICS 2014. 1-4. 10.1109/IACS.2014.6841971.
O. de Vel, A. Anderson, M. Corney, and G. Mohay. 2001. Mining e-mail content for author identification forensics. SIGMOD Rec. 30, 4 (December 2001), 55–64. https://doi.org/10.1145/604264.604272
B. Kırmacı and H. Oğul, "Evaluating text features for lyrics-based songwriter prediction," 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES), 2015, pp. 405-409, doi: 10.1109/INES.2015.7329743.
Upul Bandara, Gamini Wijayarathna, Source code author identification with unsupervised feature learning, Pattern Recognition Letters, Volume 34, Issue 3, 2013, Pages 330-334, ISSN 0167-8655, https://doi.org/10.1016/j.patrec.2012.10.027.
Alonso-Fernandez, Fernando & Belvisi, Nicole & Hernandez-Diaz, Kevin & Muhammad, Naveed & Bigun, Josef. (2021). Writer Identification Using Microblogging Texts for Social Media Forensics. IEEE Transactions on Biometrics, Behavior, and Identity Science. PP. 1-1. 10.1109/TBIOM.2021.3078073.
Akın, Ahmet Afsin, and Mehmet Dündar Akın. “Zemberek, an open source NLP framework for Turkic languages.” Structure 10.2007 (2007): 1-5.
M. S. Atar, E. Esen and M. A. Arabaci, "Supervised author recognition with aggregated word embeddings," 2018 26th Signal Processing and Communications Applications Conference (SIU), 2018, pp. 1-4, doi: 10.1109/SIU.2018.8404464.
Diri, B., and Amasyalı, M. F. (2003, June). “Automatic author detection for Turkish texts.” In Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP) (pp. 138-141).
Örücü F., Dalkılıç G., “Author Identification Using N-grams and SVM”, The 1. International Symposium on Computing in Science & Engineering, ISBN:978-605-61394-0-6 P:130, Kuşadası, 3-5 Haziran 2010
B. Kuyumcu, B. Buluz and Y. Kömeçoğlu, "Author Identification in Turkish Documents with Ridge Regression Analysis," 2019 27th Signal Processing and Communications Applications Conference (SIU), 2019, pp. 1-4, doi: 10.1109/SIU.2019.8806242.
Burcu İlkay KARAMAN, Feriştah DALKILIÇ, Emine Eda ÇAM EKER, “Author Recognition In Modern Turkish For Forensic Linguistic Cases Using Machine Learning”, 1st International, 17th National Forensic Science Congress, 12-15 November 2020, Online.

There are 29 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Articles
Authors	İbrahim Yülüce 0000-0002-3652-7184 Feriştah Dalkılıç 0000-0001-7528-5109
Project Number	378
Publication Date	July 20, 2022
Submission Date	June 13, 2022
Published in Issue	Year 2022 Volume: 6 Issue: 1

Cite

IEEE	İ. Yülüce and F. Dalkılıç, “Author Identification with Machine Learning Algorithms”, IJMSIT, vol. 6, no. 1, pp. 45–50, 2022.

Download Cover Image

Article Files

Full Text