Research Article

Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports

Volume: 1 Number: 3 September 26, 2025
TR EN

Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports

Abstract

Objective: Accurate and efficient use of International Classification of Diseases Clinical Modification Codes (ICD-10) in neurology is vital for healthcare reimbursement, research, and patient health surveillance. However, manually extracting these codes from physician reports is both time-consuming and prone to errors. This study evaluates the performance of several large language models (LLMs) in automatically predicting ICD-10 diagnosis codes specifically from Turkish neurology physician reports. Method: The study evaluates the performance of ten LLMs (ChatGPT, Cohere Coral, Claude, DeepSeek, Qwen, Groq, Gemini, Meta Llama, Mistral, and Perplexity) on a dataset of 51 de-identified neurology doctor reports. A standardized prompt was used to instruct each LLM to extract ICD-10 codes relevant to the diagnoses documented in the reports. The LLM-generated codes were then compared to a gold standard set of codes assigned by certified neurology coding specialists. Performance metrics such as accuracy, precision, recall and F1-score, were used to assess the models' effectiveness. Results: Among the LLMs, ChatGPT emerged as the top performer with an accuracy of 68.6% and an F1-score of 0.812, demonstrating strong precision (0.686) and perfect recall (1.0). It excelled in identifying common neurological conditions such as migraines (G45.9), transient ischemic attacks (TIA), and motor neuron disorders. Gemini followed closely with 58.8% accuracy (F1-score: 0.750), while Qwen and Claude showed moderate performance (54.9% and 49.0% accuracy, respectively). Conversely, Groq and Meta AI exhibited significant limitations, with accuracies of 25.5% and 27.5%, respectively. Conclusion: While LLMs show promise for automating ICD-10 coding from neurology reports, there is considerable variability in their performance. High-performing models like ChatGPT demonstrate strong potential, but further refinement is needed to improve the accuracy and reliability of lower-performing systems. Future research should focus on enhancing training datasets, incorporating rule-based algorithms, and integrating human oversight to address discrepancies, particularly in complex or rare neurological cases.

Keywords

Supporting Institution

Başkent University, Faculty of Medicine

Project Number

Project no:KA25/180

Ethical Statement

This study was approved by Baskent University Institutional Review Board (Project no:KA25/180) and supported by Baskent University Research Fund.

References

  1. Albassam, D., Cross, A., & Zhai, C. (2025). Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes. arXiv preprint arXiv:2503.22092.
  2. Barrit, S., Torcida, N., Mazeraud, A., Boulogne, S., Benoit, J., Carette, T., Carron, T., Delsaut, B., Diab, E., Kermorvant, H., Maarouf, A., Maldonado Slootjes, S., Redon, S., Robin, A., Hadidane, S., Harlay, V., Tota, V., Madec, T., Niset, A., ... Carron, R. (2025). Specialized Large Language Model Outperforms Neurologists at Complex Diagnosis in Blinded Case-Based Evaluation. Brain Sciences, 15(4), 347.
  3. Dai H, Wang C, Chen C, Liou C, Lu A, Lai C, Shain B, Ke C, Wang W, Mir T, Simanjuntak M, Kao H, Tsai M, Tseng V. (2024). Evaluating a Natural Language Processing–Driven, AI-Assisted International Classification of Diseases, 10th Revision, Clinical Modification, Coding System for Diagnosis Related Groups in a Real Hospital Environment: Algorithm Development and Validation Study. J Med Internet Res;26:e58278,
  4. Dong, H., Falis, M., Whiteley, W., Alex, B., Matterson, J., Ji, S., Chen, J., & Wu, H. (2022). Automated clinical coding: What, why, and where we are? Npj Digital Medicine, 5(1), 1-8.
  5. Kalani, M., & Anjankar, A. (2024). Revolutionizing Neurology: The Role of Artificial Intelligence in Advancing Diagnosis and Treatment. Cureus, 16(6), e61706.
  6. Kocaman, V. (2024, April 20). Comparing Spark NLP for healthcare and ChatGPT in extracting ICD10-CM codes from clinical notes . John Snow Labs. Retrieved [Insert Date of Retrieval] from https://www.johnsnowlabs.com/comparing-spark-nlp-for-healthcare-and-chatgpt-in-extracting-icd10-cm-codes-from-clinical-notes/
  7. Lee, S. A., & Lindsey, T. (2024). Can Large Language Models abstract Medical Coded Language?. arXiv preprint arXiv:2403.10822.
  8. Puts, S., Zegers, C. M. L., Dekker, A., & Bermejo, I. (2025). Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection. JMIR Formative Research, 9, e60095.

Details

Primary Language

English

Subjects

Health Services and Systems (Other)

Journal Section

Research Article

Authors

Meryem Koruk This is me
Türkiye

Mehmet Çağlar Akpinar This is me
Türkiye

Mehmet İbrahim Öksüz This is me
Türkiye

Yasmin Ayşe Öztoklu This is me
Türkiye

Sevin Suyla Turan This is me
Türkiye

Publication Date

September 26, 2025

Submission Date

July 20, 2025

Acceptance Date

September 5, 2025

Published in Issue

Year 2025 Volume: 1 Number: 3

APA
Koçak, M., Kibaroglu, S., Koruk, M., Akpinar, M. Ç., Ademoğulları, H., Öksüz, M. İ., Öztoklu, Y. A., & Turan, S. S. (2025). Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports. Northern Journal of Health Sciences, 1(3), 190-201. https://izlik.org/JA56EY24YM
AMA
1.Koçak M, Kibaroglu S, Koruk M, et al. Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports. Northern Journal of Health Sciences. 2025;1(3):190-201. https://izlik.org/JA56EY24YM
Chicago
Koçak, Murat, Seda Kibaroglu, Meryem Koruk, et al. 2025. “Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports”. Northern Journal of Health Sciences 1 (3): 190-201. https://izlik.org/JA56EY24YM.
EndNote
Koçak M, Kibaroglu S, Koruk M, Akpinar MÇ, Ademoğulları H, Öksüz Mİ, Öztoklu YA, Turan SS (September 1, 2025) Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports. Northern Journal of Health Sciences 1 3 190–201.
IEEE
[1]M. Koçak et al., “Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports”, Northern Journal of Health Sciences, vol. 1, no. 3, pp. 190–201, Sept. 2025, [Online]. Available: https://izlik.org/JA56EY24YM
ISNAD
Koçak, Murat - Kibaroglu, Seda - Koruk, Meryem - Akpinar, Mehmet Çağlar - Ademoğulları, Hüseyin - Öksüz, Mehmet İbrahim - Öztoklu, Yasmin Ayşe - Turan, Sevin Suyla. “Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports”. Northern Journal of Health Sciences 1/3 (September 1, 2025): 190-201. https://izlik.org/JA56EY24YM.
JAMA
1.Koçak M, Kibaroglu S, Koruk M, Akpinar MÇ, Ademoğulları H, Öksüz Mİ, Öztoklu YA, Turan SS. Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports. Northern Journal of Health Sciences. 2025;1:190–201.
MLA
Koçak, Murat, et al. “Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports”. Northern Journal of Health Sciences, vol. 1, no. 3, Sept. 2025, pp. 190-01, https://izlik.org/JA56EY24YM.
Vancouver
1.Murat Koçak, Seda Kibaroglu, Meryem Koruk, Mehmet Çağlar Akpinar, Hüseyin Ademoğulları, Mehmet İbrahim Öksüz, Yasmin Ayşe Öztoklu, Sevin Suyla Turan. Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports. Northern Journal of Health Sciences [Internet]. 2025 Sep. 1;1(3):190-201. Available from: https://izlik.org/JA56EY24YM