Research Article

Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude

Volume: 14 Number: 1 March 25, 2025
EN

Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude

Abstract

This study presents novel machine learning-based approaches for detecting whether source code generated by Large Language Models (LLMs) contains malicious code. To achieve this, comprehensive datasets comprising malicious and benign code samples were created using the GPT-3.5 (ChatGPT), GPT-4o, Gemini, and Claude language models. The extracted code samples were then processed through CodeBERT, CodeT5, and manual feature extraction techniques before being classified using various machine learning algorithms. Experimental results demonstrate that this approach can effectively detect malicious software in code generated by LLMs. This study makes contributions to software security and represents a crucial step toward preventing the misuse of LLMs for malicious purposes. Moreover, the Random Forest algorithm for binary malicious code classification in LLM-generated code achieved the best F$_{1}$ score of 94.92\% on the ChatGPT-generated dataset (with CodeT5 feature extraction technique). We also showed that the classification models exhibited poor performance on the dataset generated by Claude language model.

Keywords

Project Number

The authors are supported by the Scientific and Technological Research Council of Turkey, under grant TUBITAK 2209-A 1919B012324109.

References

  1. J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou, C. Gong, Y. Shen et al., “A comprehensive capability analysis of gpt-3 and gpt-3.5 series models,” arXiv preprint arXiv:2303.10420, 2023.
  2. R. Islam and O. M. Moushi, “Gpt-4o: The cutting-edge advancement in multimodal llm,” Authorea Preprints, 2024.
  3. G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024.
  4. J. Bae, S. Kwon, and S. Myeong, “Enhancing software code vulnerability detection using gpt-4o and claude-3.5 sonnet: A study on prompt engineering techniques,” Electronics, vol. 13, no. 13, p. 2657, 2024.
  5. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pretrained model for programming and natural languages,” arXiv , preprint arXiv:2002.08155, 2020.
  6. Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifieraware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv preprint arXiv:2109.00859, 2021.
  7. N. S¸ahin, “Malware detection using transformers-based model gpt-2,” Master’s thesis, Middle East Technical University, 2021.
  8. M. Botacin, “Gpthreats-3: Is automatic malware generation a threat?” in 2023 IEEE Security and Privacy Workshops (SPW). IEEE, 2023, pp. 238–254.

Details

Primary Language

English

Subjects

Software and Application Security

Journal Section

Research Article

Publication Date

March 25, 2025

Submission Date

February 6, 2025

Acceptance Date

March 15, 2025

Published in Issue

Year 2025 Volume: 14 Number: 1

APA
Kurt Pehlivanoğlu, M., & Çoban, M. G. (2025). Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude. International Journal of Information Security Science, 14(1), 1-12. https://doi.org/10.55859/ijiss.1634763
AMA
1.Kurt Pehlivanoğlu M, Çoban MG. Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude. IJISS. 2025;14(1):1-12. doi:10.55859/ijiss.1634763
Chicago
Kurt Pehlivanoğlu, Meltem, and Murat Görkem Çoban. 2025. “Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude”. International Journal of Information Security Science 14 (1): 1-12. https://doi.org/10.55859/ijiss.1634763.
EndNote
Kurt Pehlivanoğlu M, Çoban MG (March 1, 2025) Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude. International Journal of Information Security Science 14 1 1–12.
IEEE
[1]M. Kurt Pehlivanoğlu and M. G. Çoban, “Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude”, IJISS, vol. 14, no. 1, pp. 1–12, Mar. 2025, doi: 10.55859/ijiss.1634763.
ISNAD
Kurt Pehlivanoğlu, Meltem - Çoban, Murat Görkem. “Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude”. International Journal of Information Security Science 14/1 (March 1, 2025): 1-12. https://doi.org/10.55859/ijiss.1634763.
JAMA
1.Kurt Pehlivanoğlu M, Çoban MG. Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude. IJISS. 2025;14:1–12.
MLA
Kurt Pehlivanoğlu, Meltem, and Murat Görkem Çoban. “Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude”. International Journal of Information Security Science, vol. 14, no. 1, Mar. 2025, pp. 1-12, doi:10.55859/ijiss.1634763.
Vancouver
1.Meltem Kurt Pehlivanoğlu, Murat Görkem Çoban. Detection of Malicious Codes Generated by Large Language Models: A Comparison of GPT-3.5, GPT-4o, Gemini, and Claude. IJISS. 2025 Mar. 1;14(1):1-12. doi:10.55859/ijiss.1634763