Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports

Murat Koçak; Seda Kibaroglu; Meryem Koruk; Mehmet Çağlar Akpinar; Hüseyin Ademoğulları; Mehmet İbrahim Öksüz; Yasmin Ayşe Öztoklu; Sevin Suyla Turan

Araştırma Makalesi

Türkçe Nöroloji Doktor Raporlarının Büyük Dil Modelleri(LLMs) Kullanılarak Uluslararası Hastalık Sınıflandırması Kodlarını(ICD-10) Tahmin Etme Performanslarının Karşılaştırılması

Yıl 2025, Cilt: 1 Sayı: 3, 190 - 201, 26.09.2025

Murat Koçak , Seda Kibaroglu , Meryem Koruk Mehmet Çağlar Akpinar Hüseyin Ademoğulları , Mehmet İbrahim Öksüz Yasmin Ayşe Öztoklu Sevin Suyla Turan

Öz

Amaç: Nörolojide Uluslararası Hastalık Sınıflandırması Kodlarının (ICD-10) doğru ve etkin kullanımı, sağlık hizmetleri geri ödemesi, araştırma ve hasta sağlığı gözetimi için hayati önem taşımaktadır. Ancak, bu kodların hekim raporlarından manuel olarak çıkarılması hem zaman alıcıdır hem de hatalara açıktır. Bu çalışma, çeşitli büyük dil modellerinin (LLM) özellikle Türkçe nöroloji doktor raporlarından ICD-10 tanı kodlarını otomatik olarak tahmin etmedeki performansını değerlendirmektedir.
Yöntem: Çalışmada, 51 kimliksiz nöroloji doktoru raporundan oluşan bir veri kümesi üzerinde on Büyük Dil Modelinin (ChatGPT, Cohere Coral, Claude, DeepSeek, Qwen, Groq, Gemini, Meta Llama, Mistral ve Perplexity) performansı değerlendirilmiştir. Her bir Büyük Dil Modeline raporlarda belirtilen teşhislerle ilgili ICD-10 kodlarını çıkarması talimatını vermek için standartlaştırılmış bir istem kullanılmıştır. Büyük Dil Modelleri tarafından oluşturulan kodlar daha sonra nöroloji uzmanları tarafından atanan altın standart kod setiyle karşılaştırılmıştır. Modellerin etkinliğini değerlendirmek için doğruluk, kesinlik, geri çağırma ve F1-skoru gibi performans ölçütleri kullanılmıştır.
Bulgular: Değerlendirilen sistemler arasında ChatGPT, %68,6 doğruluk ve 0,812 F1 skoru ile en iyi performans gösteren sistem olarak ortaya çıkmış, güçlü bir hassasiyet (0,686) ve mükemmel bir geri çağırma (1,0) sergilemiştir. Migren (G45.9), geçici iskemik atak (TIA) ve motor nöron bozuklukları gibi yaygın nörolojik durumları tanımlamada üstünlük sağlamıştır. Gemini %58,8 doğrulukla (F1-skoru: 0,750) yakından takip ederken, Qwen ve Claude orta düzeyde performans göstermiştir (sırasıyla %54,9 ve %49,0 doğruluk). Buna karşılık, Groq ve Meta Llama sırasıyla %25,5 ve %27,5'lik doğruluk oranlarıyla daha düşük performans sergilemiştir. Bu düşük performanslı modeller özellikle ensefalit (G04.8) ve karpal tünel sendromu (G56.0) gibi karmaşık veya nadir vakalarda zorlanmış, bu da nüanslı nörolojik durumlar konusunda daha iyi eğitim ihtiyacını vurgulamıştır.
Sonuç: Büyük Dil Modelleri nöroloji raporlarında ICD-10 kodlamasını otomatikleştirmek için umut vaat ederken, performanslarında önemli değişkenlikler vardır. ChatGPT, Gemini and Qwen gibi yüksek performanslı modeller güçlü bir potansiyel göstermektedir, ancak daha düşük performanslı sistemlerin doğruluğunu ve güvenilirliğini artırmak için daha fazla iyileştirme yapılması gerekmektedir. Gelecekteki araştırmalar, özellikle karmaşık veya nadir nörolojik vakalardaki tutarsızlıkları ele almak için eğitim veri setlerini geliştirmeye, kural tabanlı algoritmaları dahil etmeye ve insan gözetimini entegre etmeye odaklanmalıdır.

Anahtar Kelimeler

Yapay Zeka , Nöroloji Raporları , Büyük Dil Modelleri (LLM) , Doğal Dil İşleme , ICD-10 Kodlama , Tıp Bilişimi

Proje Numarası

Project no:KA25/180

Kaynakça

Albassam, D., Cross, A., & Zhai, C. (2025). Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes. arXiv preprint arXiv:2503.22092.
Barrit, S., Torcida, N., Mazeraud, A., Boulogne, S., Benoit, J., Carette, T., Carron, T., Delsaut, B., Diab, E., Kermorvant, H., Maarouf, A., Maldonado Slootjes, S., Redon, S., Robin, A., Hadidane, S., Harlay, V., Tota, V., Madec, T., Niset, A., ... Carron, R. (2025). Specialized Large Language Model Outperforms Neurologists at Complex Diagnosis in Blinded Case-Based Evaluation. Brain Sciences, 15(4), 347.
Dai H, Wang C, Chen C, Liou C, Lu A, Lai C, Shain B, Ke C, Wang W, Mir T, Simanjuntak M, Kao H, Tsai M, Tseng V. (2024). Evaluating a Natural Language Processing–Driven, AI-Assisted International Classification of Diseases, 10th Revision, Clinical Modification, Coding System for Diagnosis Related Groups in a Real Hospital Environment: Algorithm Development and Validation Study. J Med Internet Res;26:e58278,
Dong, H., Falis, M., Whiteley, W., Alex, B., Matterson, J., Ji, S., Chen, J., & Wu, H. (2022). Automated clinical coding: What, why, and where we are? Npj Digital Medicine, 5(1), 1-8.
Kalani, M., & Anjankar, A. (2024). Revolutionizing Neurology: The Role of Artificial Intelligence in Advancing Diagnosis and Treatment. Cureus, 16(6), e61706.
Kocaman, V. (2024, April 20). Comparing Spark NLP for healthcare and ChatGPT in extracting ICD10-CM codes from clinical notes . John Snow Labs. Retrieved [Insert Date of Retrieval] from https://www.johnsnowlabs.com/comparing-spark-nlp-for-healthcare-and-chatgpt-in-extracting-icd10-cm-codes-from-clinical-notes/
Lee, S. A., & Lindsey, T. (2024). Can Large Language Models abstract Medical Coded Language?. arXiv preprint arXiv:2403.10822.
Puts, S., Zegers, C. M. L., Dekker, A., & Bermejo, I. (2025). Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection. JMIR Formative Research, 9, e60095.
Reshma, O. K., Saleena, N., & Nazeer, K. A. (2025). Context-aware automated ICD coding: A semantic-driven approach. Information Systems, 132, 102539.
Schumacher, E., Naik, D., & Kannan, A. (2025). Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease. arXiv preprint arXiv:2502.15069.
Simmons, A., Takkavatakarn, K., McDougal, M., Dilcher, B., Pincavitch, J., Meadows, L., ... & Sakhuja, A. (2024). Benchmarking large language models for extraction of international classification of diseases codes from clinical documentation. medRxiv, 2024-04. Soroush, A., Glicksberg, B. S., Zimlichman, E., Barash, Y., Freeman, R., Charney, A. W., ... & Klang, E. (2024). Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI, 1(5),
Stanfill, M. H., Williams, M., Fenton, S. H., Jenders, R. A., & Hersh, W. R. (2010). A systematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association : JAMIA, 17(6), 646–651.
World Health Organization. (2015). The International Classification of Diseases, 10th Revision. https://icd.who.int/browse10/2015/en

Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports

Yıl 2025, Cilt: 1 Sayı: 3, 190 - 201, 26.09.2025

Murat Koçak , Seda Kibaroglu , Meryem Koruk Mehmet Çağlar Akpinar Hüseyin Ademoğulları , Mehmet İbrahim Öksüz Yasmin Ayşe Öztoklu Sevin Suyla Turan

Öz

Objective: Accurate and efficient use of International Classification of Diseases Clinical Modification Codes (ICD-10) in neurology is vital for healthcare reimbursement, research, and patient health surveillance. However, manually extracting these codes from physician reports is both time-consuming and prone to errors. This study evaluates the performance of several large language models (LLMs) in automatically predicting ICD-10 diagnosis codes specifically from Turkish neurology physician reports.

Method: The study evaluates the performance of ten LLMs (ChatGPT, Cohere Coral, Claude, DeepSeek, Qwen, Groq, Gemini, Meta Llama, Mistral, and Perplexity) on a dataset of 51 de-identified neurology doctor reports. A standardized prompt was used to instruct each LLM to extract ICD-10 codes relevant to the diagnoses documented in the reports. The LLM-generated codes were then compared to a gold standard set of codes assigned by certified neurology coding specialists. Performance metrics such as accuracy, precision, recall and F1-score, were used to assess the models' effectiveness.

Results: Among the LLMs, ChatGPT emerged as the top performer with an accuracy of 68.6% and an F1-score of 0.812, demonstrating strong precision (0.686) and perfect recall (1.0). It excelled in identifying common neurological conditions such as migraines (G45.9), transient ischemic attacks (TIA), and motor neuron disorders. Gemini followed closely with 58.8% accuracy (F1-score: 0.750), while Qwen and Claude showed moderate performance (54.9% and 49.0% accuracy, respectively). Conversely, Groq and Meta AI exhibited significant limitations, with accuracies of 25.5% and 27.5%, respectively.

Conclusion: While LLMs show promise for automating ICD-10 coding from neurology reports, there is considerable variability in their performance. High-performing models like ChatGPT demonstrate strong potential, but further refinement is needed to improve the accuracy and reliability of lower-performing systems. Future research should focus on enhancing training datasets, incorporating rule-based algorithms, and integrating human oversight to address discrepancies, particularly in complex or rare neurological cases.

Anahtar Kelimeler

Artificial Intelligence , Neurology Reports , Large Language Models , Natural Language Processing , ICD-10 Coding , Medical Informatics

Etik Beyan

This study was approved by Baskent University Institutional Review Board (Project no:KA25/180) and supported by Baskent University Research Fund.

Destekleyen Kurum

Başkent University, Faculty of Medicine

Proje Numarası

Project no:KA25/180

Kaynakça

Albassam, D., Cross, A., & Zhai, C. (2025). Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes. arXiv preprint arXiv:2503.22092.
Barrit, S., Torcida, N., Mazeraud, A., Boulogne, S., Benoit, J., Carette, T., Carron, T., Delsaut, B., Diab, E., Kermorvant, H., Maarouf, A., Maldonado Slootjes, S., Redon, S., Robin, A., Hadidane, S., Harlay, V., Tota, V., Madec, T., Niset, A., ... Carron, R. (2025). Specialized Large Language Model Outperforms Neurologists at Complex Diagnosis in Blinded Case-Based Evaluation. Brain Sciences, 15(4), 347.
Dai H, Wang C, Chen C, Liou C, Lu A, Lai C, Shain B, Ke C, Wang W, Mir T, Simanjuntak M, Kao H, Tsai M, Tseng V. (2024). Evaluating a Natural Language Processing–Driven, AI-Assisted International Classification of Diseases, 10th Revision, Clinical Modification, Coding System for Diagnosis Related Groups in a Real Hospital Environment: Algorithm Development and Validation Study. J Med Internet Res;26:e58278,
Dong, H., Falis, M., Whiteley, W., Alex, B., Matterson, J., Ji, S., Chen, J., & Wu, H. (2022). Automated clinical coding: What, why, and where we are? Npj Digital Medicine, 5(1), 1-8.
Kalani, M., & Anjankar, A. (2024). Revolutionizing Neurology: The Role of Artificial Intelligence in Advancing Diagnosis and Treatment. Cureus, 16(6), e61706.
Kocaman, V. (2024, April 20). Comparing Spark NLP for healthcare and ChatGPT in extracting ICD10-CM codes from clinical notes . John Snow Labs. Retrieved [Insert Date of Retrieval] from https://www.johnsnowlabs.com/comparing-spark-nlp-for-healthcare-and-chatgpt-in-extracting-icd10-cm-codes-from-clinical-notes/
Lee, S. A., & Lindsey, T. (2024). Can Large Language Models abstract Medical Coded Language?. arXiv preprint arXiv:2403.10822.
Puts, S., Zegers, C. M. L., Dekker, A., & Bermejo, I. (2025). Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection. JMIR Formative Research, 9, e60095.
Reshma, O. K., Saleena, N., & Nazeer, K. A. (2025). Context-aware automated ICD coding: A semantic-driven approach. Information Systems, 132, 102539.
Schumacher, E., Naik, D., & Kannan, A. (2025). Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease. arXiv preprint arXiv:2502.15069.
Simmons, A., Takkavatakarn, K., McDougal, M., Dilcher, B., Pincavitch, J., Meadows, L., ... & Sakhuja, A. (2024). Benchmarking large language models for extraction of international classification of diseases codes from clinical documentation. medRxiv, 2024-04. Soroush, A., Glicksberg, B. S., Zimlichman, E., Barash, Y., Freeman, R., Charney, A. W., ... & Klang, E. (2024). Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI, 1(5),
Stanfill, M. H., Williams, M., Fenton, S. H., Jenders, R. A., & Hersh, W. R. (2010). A systematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association : JAMIA, 17(6), 646–651.
World Health Organization. (2015). The International Classification of Diseases, 10th Revision. https://icd.who.int/browse10/2015/en

Toplam 13 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Sağlık Hizmetleri ve Sistemleri (Diğer)
Bölüm	Research Article
Yazarlar	Murat Koçak 0000-0001-6510-3666 Seda Kibaroglu 0000-0002-3964-268X Meryem Koruk Bu kişi benim Mehmet Çağlar Akpinar Bu kişi benim Hüseyin Ademoğulları Mehmet İbrahim Öksüz Bu kişi benim Yasmin Ayşe Öztoklu Bu kişi benim Sevin Suyla Turan Bu kişi benim
Proje Numarası	Project no:KA25/180
Yayımlanma Tarihi	26 Eylül 2025
Gönderilme Tarihi	20 Temmuz 2025
Kabul Tarihi	5 Eylül 2025
Yayımlandığı Sayı	Yıl 2025 Cilt: 1 Sayı: 3

Kaynak Göster

APA	Koçak, M., Kibaroglu, S., Koruk, M., … Akpinar, M. Ç. (2025). Comparison of the Performance of Large Language Models (LLMs) in Predicting International Classification of Diseases Codes (ICD-10) Using Turkish Neurology Doctor Reports. Northern Journal of Health Sciences, 1(3), 190-201.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin