Diagnostic performance of artificial intelligence-based large language models in acute ischemic stroke detection

Ali Şalbaş; Murat Yoğurtçu; Özge Ertem; Fazıl Gelal

doi:10.52309/jaihs.1880689

TR EN

Akut iskemik inme tespitinde yapay zeka tabanlı büyük dil modellerinin tanısal performansı

Abstract

ÖZET Amaç: Bu çalışmanın amacı, iki büyük dil modelinin (ChatGPT 5.2 ve Claude 4.5 Opus) akut iskemik inme tespitinde beyin difüzyon manyetik rezonans görüntülerini analiz etme performansını değerlendirmek ve karşılaştırmaktır. Gereç ve Yöntem: Bu tek merkezli retrospektif çalışmaya akut orta serebral arter sulama alanı enfarktı olan 58 hasta ve normal difüzyon bulguları olan 62 kontrol hastası dahil edildi. Her hasta için bir difüzyon ağırlıklı görüntüleme (DAG) ve karşılık gelen görünür difüzyon katsayısı (ADC) kesiti seçildi. Görüntüler standart bir prompt kullanılarak her iki modele sunuldu. Modellerden MRG sekans tipini tanımlaması, difüzyon kısıtlamasının varlığını belirlemesi ve etkilenen vasküler alanı sınıflandırması istendi. Model performansları karşılaştırıldı. Bulgular: Sekans tanımlamada ChatGPT %97,5-100, Claude ise %93,3-99,2 doğruluk gösterdi. Her iki modelde de en sık yapılan hata DAG görüntülerinin FLAIR olarak yanlış sınıflandırılmasıydı.Difüzyon kısıtlaması tespitinde ChatGPT %95,8 genel doğruluk (%98,3 duyarlılık, %93,5 özgüllük), Claude ise %87,5 genel doğruluk (%94,8 duyarlılık, %80,6 özgüllük) elde etti. İki model arasındaki fark istatistiksel olarak anlamlıydı (p=0,041). Sulama alanı sınıflandırmasında ChatGPT %93,1, Claude ise %62,1 doğruluk gösterdi. Sonuç: Her iki model de MRG sekans tanımlamada yüksek başarı göstermiştir. Difüzyon kısıtlaması tespiti ve vasküler alan sınıflandırmasında ChatGPT, Claude'a göre daha iyi performans sergilemiştir. Bu bulgular, büyük dil modellerinin akut inme görüntülemesinde destekleyici araç olarak potansiyele sahip olduğunu düşündürmektedir; ancak bağımsız klinik kullanım için henüz yeterli güvenilirlikte değildir.

Keywords

Diagnostic performance of artificial intelligence-based large language models in acute ischemic stroke detection

Abstract

Purpose: The aim of this study was to evaluate and compare the performance of two large language models (ChatGPT 5.2 and Claude 4.5 Opus) in analyzing brain diffusion magnetic resonance images (MRI) for the detection of acute ischemic stroke. Materials and Methods: This single-center retrospective study included 58 patients with acute middle cerebral artery territory infarction and 62 control patients with normal diffusion MRI findings. For each patient, one diffusion-weighted imaging (DWI) and corresponding apparent diffusion coefficient (ADC) slice was selected. Images were presented to both models using a standardized prompt. The models were asked to identify the MRI sequence type, determine the presence of diffusion restriction, and classify the affected vascular territory. Model performances were compared. Results: For sequence identification, ChatGPT achieved 97.5-100% accuracy, while Claude achieved 93.3-99.2% accuracy. The most common error in both models was misclassification of DWI images as FLAIR. For detection of diffusion restriction, ChatGPT achieved 95.8% overall accuracy (98.3% sensitivity, 93.5% specificity), while Claude achieved 87.5% overall accuracy (94.8% sensitivity, 80.6% specificity). The difference between the two models was statistically significant (p=0.041). For vascular territory classification, ChatGPT achieved 93.1% accuracy, while Claude achieved 62.1% accuracy. Conclusion: Both models demonstrated high performance in MRI sequence identification. ChatGPT outperformed Claude in detection of diffusion restriction and vascular territory classification. These findings suggest that large language models have potential as supportive tools in acute stroke imaging; however, they are not yet sufficiently reliable for independent clinical use.

Keywords

Supporting Institution

None.

Ethical Statement

This study was conducted in accordance with the Declaration of Helsinki. Ethical approval was obtained from the İzmir Katip Çelebi University Health Research Ethics Committee (decision number: 0669, date: 06.11.2025). Due to the retrospective design of the study, the requirement for informed consent was waived.

Thanks

None.

References

1. Wang W, Jiang B, Sun H, et al. Prevalence, Incidence, and Mortality of Stroke in China: Results from a Nationwide Population-Based Survey of 480 687 Adults. Circulation. 2017 Feb 21;135(8):759–71.
2. Gilotra K, Swarna S, Mani R, Basem J, Dashti R. Role of artificial intelligence and machine learning in the diagnosis of cerebrovascular disease. Front Hum Neurosci. 2023 Sep 7;17.
3. Santana Baskar P, Cordato D, Wardman D, Bhaskar S. In‐hospital acute stroke workflow in acute stroke – Systems‐based approaches. Acta Neurol Scand. 2021 Feb 12;143(2):111–20.
4. Simonsen CZ, Madsen MH, Schmitz ML, Mikkelsen IK, Fisher M, Andersen G. Sensitivity of diffusion- and perfusion-weighted imaging for diagnosing acute ischemic stroke is 97.5%. Stroke. 2015 Jan;46(1):98–101.
5. Naveed H, Khan AU, Qiu S, et al. A Comprehensive Overview of Large Language Models. ACM Trans Intell Syst Technol. 2025 Oct 31;16(5):1–72.
6. Yasar Y, Demir M, Canturk A, Ozyilmaz S, Turgan AH, Agackaya Y. Effect of ChatGPT-Assisted Reflective Reasoning on Guideline-Concordant Procedural Decision-Making Among Early-Career Interventional Radiologists. Acad Radiol. 2026 Jan;
7. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023 Aug 17;29(8):1930–40.
8. Mijares J, Jairath N, Zhang A, Que SKT. Validation of a Dermatology-Focused Multimodal Large Language Model in Classification of Pigmented Skin Lesions. Diagnostics. 2025 Nov 6;15(21):2808.

9. Horiuchi D, Tatekawa H, Oura T, et al. Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases. Clin Neuroradiol. 2024 Dec 28;34(4):779–87.
10. Karahan BN, Emekli E, Altın MA. Artificial Intelligence-Based Chatbots’ Ability to Interpret Mammography Images: A Comparison of Chat-GPT 4o and Claude 3.5. European Journal of Therapeutics. 2025 Feb 28;31(1):28–34.
11. Uzun Bektaş A, Bora B, Ünsal E. Comparative evaluation of ChatGPT and LLaMA for reliability, quality, and accuracy in familial Mediterranean fever. Eur J Pediatr. 2025 Jul 18;184(8):491.
12. Kuzan BN, Meşe İ, Yaşar S, Kuzan TY. A retrospective evaluation of the potential of ChatGPT in the accurate diagnosis of acute stroke. Diagn Interv Radiol. 2025;31(3):187-195.
13. Koyun M, Taskent I. Evaluation of Advanced Artificial Intelligence Algorithms’ Diagnostic Efficacy in Acute Ischemic Stroke: A Comparative Analysis of ChatGPT-4o and Claude 3.5 Sonnet Models. J Clin Med. 2025 Jan 17;14(2):571.
14. OpenAI. ChatGPT-5 web platform [Internet]. 2026 [cited 2026 Jan 20]. Available from: https://chat.openai.com
15. Anthropic. Claude.ai official interface [Internet]. 2026 [cited 2026 Jan 20]. Available from: https://claude.ai
16. Ozenbas C, Engin D, Altinok T, Akcay E, Aktas U, Tabanli A. ChatGPT-4o’s Performance in Brain Tumor Diagnosis and MRI Findings: A Comparative Analysis with Radiologists. Acad Radiol. 2025 Jun;32(6):3608–17.
17. Salbas A, Buyuktoka RE. Performance of Large Language Models in Recognizing Brain MRI Sequences: A Comparative Analysis of ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro. Diagnostics. 2025 Jul 30;15(15):1919. 18. Sozer A, Sahin MC, Sozer B, Do LLMs Have ‘the Eye’ for MRI? Evaluating GPT-4o, Grok, and Gemini on Brain MRI Performance: First Evaluation of Grok in Medical Imaging and a Comparative Analysis. Diagnostics. 2025 May 24;15(11):1320.

Details

Primary Language

English

Subjects

Radiology and Organ Imaging

Journal Section

Research Article

Authors

Ali Şalbaş ^*
0000-0002-6157-6367
Türkiye

Murat Yoğurtçu
0000-0002-2190-5111
Türkiye

Özge Ertem
0000-0002-9599-3361
Türkiye

Fazıl Gelal
0000-0003-1263-0918
Türkiye

Publication Date

April 27, 2026

Submission Date

February 4, 2026

Acceptance Date

March 12, 2026

Published in Issue

Year 2026 Volume: 6 Number: 1

DOI

https://doi.org/10.52309/jaihs.1880689

IZ

https://izlik.org/JA68JD66PZ

Cite

RIS / Bibtex

APA

Şalbaş, A., Yoğurtçu, M., Ertem, Ö., & Gelal, F. (2026). Diagnostic performance of artificial intelligence-based large language models in acute ischemic stroke detection. Sağlık Bilimlerinde Yapay Zeka Dergisi, 6(1), 13-21. https://doi.org/10.52309/jaihs.1880689

AMA

1.Şalbaş A, Yoğurtçu M, Ertem Ö, Gelal F. Diagnostic performance of artificial intelligence-based large language models in acute ischemic stroke detection. JAIHS. 2026;6(1):13-21. doi:10.52309/jaihs.1880689

Chicago

Şalbaş, Ali, Murat Yoğurtçu, Özge Ertem, and Fazıl Gelal. 2026. “Diagnostic Performance of Artificial Intelligence-Based Large Language Models in Acute Ischemic Stroke Detection”. Sağlık Bilimlerinde Yapay Zeka Dergisi 6 (1): 13-21. https://doi.org/10.52309/jaihs.1880689.

EndNote

Şalbaş A, Yoğurtçu M, Ertem Ö, Gelal F (April 1, 2026) Diagnostic performance of artificial intelligence-based large language models in acute ischemic stroke detection. Sağlık Bilimlerinde Yapay Zeka Dergisi 6 1 13–21.

IEEE

[1]A. Şalbaş, M. Yoğurtçu, Ö. Ertem, and F. Gelal, “Diagnostic performance of artificial intelligence-based large language models in acute ischemic stroke detection”, JAIHS, vol. 6, no. 1, pp. 13–21, Apr. 2026, doi: 10.52309/jaihs.1880689.

ISNAD

Şalbaş, Ali - Yoğurtçu, Murat - Ertem, Özge - Gelal, Fazıl. “Diagnostic Performance of Artificial Intelligence-Based Large Language Models in Acute Ischemic Stroke Detection”. Sağlık Bilimlerinde Yapay Zeka Dergisi 6/1 (April 1, 2026): 13-21. https://doi.org/10.52309/jaihs.1880689.

JAMA

1.Şalbaş A, Yoğurtçu M, Ertem Ö, Gelal F. Diagnostic performance of artificial intelligence-based large language models in acute ischemic stroke detection. JAIHS. 2026;6:13–21.

MLA

Şalbaş, Ali, et al. “Diagnostic Performance of Artificial Intelligence-Based Large Language Models in Acute Ischemic Stroke Detection”. Sağlık Bilimlerinde Yapay Zeka Dergisi, vol. 6, no. 1, Apr. 2026, pp. 13-21, doi:10.52309/jaihs.1880689.

Vancouver

1.Ali Şalbaş, Murat Yoğurtçu, Özge Ertem, Fazıl Gelal. Diagnostic performance of artificial intelligence-based large language models in acute ischemic stroke detection. JAIHS. 2026 Apr. 1;6(1):13-21. doi:10.52309/jaihs.1880689