Research Article
BibTex RIS Cite

CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS

Year 2024, Volume: 11 Issue: 2, 138 - 145, 31.12.2024

Abstract

Purpose- Deception detection has gained increasing importance with the widespread use of digital communication and online platforms. While numerous studies have been conducted on deception detection in various languages, a significant gap remains in the availability of a Turkish-language dataset for detecting deceptive reviews. This study addresses this gap by creating a comprehensive dataset specifically for deception detection in Turkish hotel reviews, including real, fake, and AI-generated comments. The dataset aims to facilitate research on deception detection, enhance the reliability of user-generated content, and contribute to the development of automated methods for identifying deceptive texts.
Methodology- The study included a dataset of 5,013 Turkish hotel reviews, including real reviews from Tripadvisor, fake reviews generated by humans, and fake reviews generated by AI using the OpenAI GPT API. The collected dataset underwent extensive preprocessing to ensure quality and reliability, including data cleaning, filtering criteria, and balancing the distribution of real and fake comments. Descriptive and statistical analyses were performed to identify linguistic patterns and structural differences across these three categories. Specifically, linguistic features such as comment length, complexity, readability (measured using the Gunning Fog Index), and pronoun usage were examined.
Findings- Real comments are longer and more detailed than fake and AI-generated comments, while fake comments are simpler and clearer, which supports deception detection studies in other languages. AI-generated comments frequently use the pronoun ‘we’, while fake comments tend to mimic personal experience with the pronoun ‘I’. In addition, the pronoun usage in real comments is more balanced and shows an authentic language structure.
Conclusion- This study makes important contributions for fake comment detection by providing the first large-scale Turkish deception detection dataset. The findings can help businesses improve the credibility of online comments. Future work could focus on machine learning applications and comparisons with different languages.

References

  • Catelli, R., Bevilacqua, L., Mariniello, N., Di Carlo, V. S., Magaldi, M., Fujita, H., De Pietro, G., & Esposito, M. (2023). A new Italian cultural heritage data set: Detecting fake reviews with BERT and ELECTRA leveraging the sentiment. IEEE Access, 11, 52214–52225.
  • Ekman, P., & O'Sullivan, M. (1991). Who can catch a liar? American psychologist, 46(9), 913.
  • Esmaeili, N. (2015). Strategic management and its application in modern organizations. International Journal of Organizational Leadership, 4, 118-126.
  • Gröndahl, T., & Asokan, N. (2019). Text analysis in adversarial settings: Does deception leave a stylistic trace?. ACM Computing Surveys (CSUR), 52(3), 1-36.
  • Hammad, A. A., & El-Halees, A. (2013). An approach for detecting spam in Arabic opinion reviews. The International Arab Journal of Information Technology, 12.
  • Hancock, J. T., Thom-Santelli, J., & Ritchie, T. (2004). Deception and design: The impact of communication technology on lying behavior. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 129–134.
  • Hancock, J. T., Curry, L. E., Goorha, S., & Woodworth, M. (2007). On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Discourse Processes, 45(1), 1-23.
  • Ignat, O., Xu, X., & Mihalcea, R. (2024). MAiDE-up: Multilingual deception detection of GPT-generated hotel reviews. arXiv preprint arXiv:2404.12938.
  • Lee, C. C., Welker, R. B., & Odom, M. D. (2009). Features of computer‐mediated, text‐based messages that support automatable, linguistics‐based indicators for deception detection. Journal of Information Systems, 23(1), 5-24.
  • Li, H., Chen, Z., Liu, B., Wei, X., & Shao, J. (2014). Spotting fake reviews via collective positive-unlabeled learning. 2014 IEEE International Conference on Data Mining, 899–904.
  • Li, J., Ott, M., Cardie, C., & Hovy, E. (2014). Towards a general rule for identifying deceptive opinion spam. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1566–1576.
  • Louwerse, M., Lin, D., Drescher, A., & Semin, G. (2010). Linguistic cues predict fraudulent events in a corporate social network. Proceedings of the Annual Meeting of the Cognitive Science Society, 32(32).
  • Markowitz, D. M., & Hancock, J. T. (2014). Linguistic traces of a scientific fraud: The case of Diederik Stapel. PloS one, 9(8), e105937.
  • Mihalcea, R., & Strapparava, C. (2009, August). The lie detector: Explorations in the automatic recognition of deceptive language. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, 309–312.
  • Mukherjee, A., Venkataraman, V., Liu, B., & Glance, N. (2013). What Yelp fake review filter might be doing? Proceedings of the International AAAI Conference on Web and Social Media, 7(1), 409–418.
  • Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and social psychology bulletin, 29(5), 665-675.
  • Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. arXiv preprint arXiv:1107.4557.
  • Ott, M., Cardie, C., & Hancock, J. T. (2013). Negative deceptive opinion spam. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 497–501.
  • Patchin, J. W., & Hinduja, S. (2010). Cyberbullying and self-esteem. Journal of School Health, 80(12), 614–621.
  • Salminen, J. (2024, March 20). Fake reviews dataset. OSF. Retrieved July 8, 2024, from osf.io/tyue9.
  • Van Dinh, C., Luu, S. T., & Nguyen, A. G. T. (2022). Detecting spam reviews on Vietnamese e-commerce websites. Asian Conference on Intelligent Information and Database Systems, 595–607. Springer International Publishing.
  • Viji, D., Gupta, N., & Parekh, K. H. (2022). History of deception detection techniques. Proceedings of the International Conference on Deep Learning, Computing and Intelligence: ICDCI 2021, 373–387. Springer.
  • Vrij, A., Mann, S. A., Fisher, R. P., Leal, S., Milne, R., & Bull, R. (2008). Increasing cognitive load to facilitate lie detection: The benefit of recalling an event in reverse order. Law and Human Behavior, 32, 253–265.
  • Whitty, M. T., & Buchanan, T. (2012). The online romance scam: A serious cybercrime. CyberPsychology, Behavior, and Social Networking, 15(3), 181–183.
  • Xu, Y., Shi, B., Tian, W., & Lam, W. (2015). A unified model for unsupervised opinion spamming detection incorporating text generality. Twenty-Fourth International Joint Conference on Artificial Intelligence.
There are 25 citations in total.

Details

Primary Language English
Subjects Labor Economics, Microeconomics (Other), Business Administration
Journal Section Articles
Authors

Ekin Akkol 0000-0003-2924-8758

Yılmaz Gökşen 0000-0002-2291-2946

Publication Date December 31, 2024
Submission Date November 1, 2024
Acceptance Date December 20, 2024
Published in Issue Year 2024 Volume: 11 Issue: 2

Cite

APA Akkol, E., & Gökşen, Y. (2024). CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS. Research Journal of Business and Management, 11(2), 138-145. https://doi.org/10.17261/Pressacademia.2024.1960
AMA Akkol E, Gökşen Y. CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS. RJBM. December 2024;11(2):138-145. doi:10.17261/Pressacademia.2024.1960
Chicago Akkol, Ekin, and Yılmaz Gökşen. “CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS”. Research Journal of Business and Management 11, no. 2 (December 2024): 138-45. https://doi.org/10.17261/Pressacademia.2024.1960.
EndNote Akkol E, Gökşen Y (December 1, 2024) CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS. Research Journal of Business and Management 11 2 138–145.
IEEE E. Akkol and Y. Gökşen, “CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS”, RJBM, vol. 11, no. 2, pp. 138–145, 2024, doi: 10.17261/Pressacademia.2024.1960.
ISNAD Akkol, Ekin - Gökşen, Yılmaz. “CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS”. Research Journal of Business and Management 11/2 (December 2024), 138-145. https://doi.org/10.17261/Pressacademia.2024.1960.
JAMA Akkol E, Gökşen Y. CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS. RJBM. 2024;11:138–145.
MLA Akkol, Ekin and Yılmaz Gökşen. “CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS”. Research Journal of Business and Management, vol. 11, no. 2, 2024, pp. 138-45, doi:10.17261/Pressacademia.2024.1960.
Vancouver Akkol E, Gökşen Y. CREATING A COMPREHENSIVE DATA SET FOR DECEPTION DETECTION STUDIES IN TURKISH TEXTS. RJBM. 2024;11(2):138-45.

Research Journal of Business and Management (RJBM) is a scientific, academic, double blind peer-reviewed, semi-annually and open-access online journal. The journal publishes 2 issues a year. The issuing months are June and December. The publication language of the Journal is English. RJBM aims to provide a research source for all practitioners, policy makers, professionals and researchers working in all related areas of business, management and organizations. The editor in chief of RJBM invites all manuscripts that cover theoretical and/or applied researches on topics related to the interest areas of the Journal. RJBM publishes academic research studies only. RJBM charges no submission or publication fee.

Ethics Policy - RJBM applies the standards of Committee on Publication Ethics (COPE). RJBM is committed to the academic community ensuring ethics and quality of manuscripts in publications. Plagiarism is strictly forbidden and the manuscripts found to be plagiarized will not be accepted or if published will be removed from the publication. Authors must certify that their manuscripts are their original work. Plagiarism, duplicate, data fabrication and redundant publications are forbidden. The manuscripts are subject to plagiarism check by iThenticate or similar. All manuscript submissions must provide a similarity report (up to 15% excluding quotes, bibliography, abstract).

Open Access - All research articles published in PressAcademia Journals are fully open access; immediately freely available to read, download and share. Articles are published under the terms of a Creative Commons license which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Open access is a property of individual works, not necessarily journals or publishers. Community standards, rather than copyright law, will continue to provide the mechanism for enforcement of proper attribution and responsible use of the published work, as they do now.