Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports

Hasan Emin Kaya; Dilek Sağlam; Zeynep Yazıcı; Gökhan Gökalp

doi:10.32708/uutfd.1653680

EN TR

Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports

Öz

The aim of the study was to evaluate and compare the performance of three popular large language models (LLMs) in generating impressions for radiology reports in Turkish. ChatGPT, Gemini, and Copilot were used to generate impressions for 50 anonymized radiology reports using a “few-shot” prompt. The impressions were scored by three radiologists using a Likert scale, based on whether they included all relevant information from the report, provided an appropriate summary of the report, contained no misleading information, and could be added to the report without modification. Friedman's test was used to evaluate whether there was a difference between the scores of the LLMs. The 50 reports included 32 magnetic resonance examinations, 11 computed tomography examinations, 5 ultrasound examinations, and 2 fluoroscopy examinations. Of these, 15 were neuroradiology studies, 14 were musculoskeletal studies, 13 were abdominal studies, and 8 were thoracic radiology studies. The median scores for the models’ outputs were 4 and 5. This finding indicates that the radiologists generally found the models successful in generating impressions. Furthermore, no statistically significant difference was found among the models in terms of their performance in containing all information, providing an appropriate summary, avoiding misleading information, and being suitable for inclusion in the report without modification (p = 0.607, 0.327, 0.629, 0.089, respectively). In conclusion, ChatGPT, Gemini, and Copilot were found to be successful in generating impressions for radiology reports in Turkish, and no significant difference in performance was detected among the models.

Anahtar Kelimeler

Büyük Dil Modellerinin Radyoloji Raporları İçin Sonuç Bölümü Oluşturmadaki Performanslarının Değerlendirilmesi

Öz

Çalışmamızın amacı popüler üç büyük dil modelinin (BDM) Türkçe radyoloji raporları için sonuç bölümü oluşturma konusundaki performansını değerlendirip mukayese etmekti. Anonimize edilmiş 50 radyoloji raporu için, “few-shot” bir komut ile, ChatGPT, Gemini ve Copilot dil modellerine sonuç bölümü oluşturuldu. Sonuçlar; rapordaki tüm bilgileri içerme, raporu uygun bir şekilde özetleme, yanıltıcı bilgi içermeme ve değiştirilmeden rapora eklenebilme açısından üç radyolog tarafından bir Likert skalası kullanılarak skorlandı. Friedman testi ile BDM’lerin skorları arasında fark olup olmadığı değerlendirildi. Çalışmaya dahil edilen 50 raporun 32’si manyetik rezonans, 11’i bilgisayarlı tomografi, 5’i ultrason ve 2’si floroskopi tetkikleriydi. Bu tetkiklerden 15’i nöroradyoloji, 14’ü kas-iskelet, 13’ü abdomen ve 8’i toraks radyolojisi çalışmalarıydı. Üç radyoloğun yaptığı skorlamalarda modellerin aldığı skorların medyan değerleri 4 ve 5 idi. Bu bulgu modellerin sonuç oluşturmada radyologlar tarafından genel olarak başarılı bulunduğunu göstermekteydi. Ayrıca modeller arasında bütün bilgileri içerme, raporu uygun bir şekilde özetleme, yanıltıcı bilgi içermeme ve değiştirilmeden rapora eklenebilme performansı açısından istatistiksel bir farklılık saptanmadı (p değerleri sırasıyla 0,607; 0,327; 0,629; 0,089). Sonuç olarak ChatGPT, Gemini ve Copilot Türkçe radyoloji raporları için sonuç bölümü oluşturmada başarılı bulunmuş ve modellerin performansı arasında anlamlı bir farklılık saptanmamıştır.

Anahtar Kelimeler

Kaynakça

1. Elkassem AA, Smith AD. Potential Use Cases for ChatGPT in Radiology Reporting. AJR Am J Roentgenol 2023;221(3):373–6.
2. Amin KS, Davis MA, Doshi R, Haims AH, Khosla P, Forman HP. Accuracy of ChatGPT, Google Bard, and Microsoft Bing for Simplifying Radiology Reports. Radiology 2023;309(2).
3. Sun Z, Ong H, Kennedy P, Tang L, Chen S, Elias J, et al. Evaluating GPT4 on Impressions Generation in Radiology Reports. Radiology 2023;307(5).
4.Zhang L, Liu M, Wang L, Zhang Y, Xu X, Pan Z, et al.Constructing a Large Language Model to Generate Impressionsfrom Findings in Radiology Reports. Radiology 2024;312(3).
5.Doshi R, Amin KS, Khosla P, Bajaj S, Chheang S, Forman HP.Quantitative Evaluation of Large Language Models toStreamline Radiology Report Impressions: A Multimodal Retrospective Analysis. Radiology 2024;310(3).
6.Can E, Uller W, Vogt K, Doppler MC, Busch F, Bayerl N, et al.Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis. Acad Radiol. 2024 Sep 30:S1076-6332(24)00690-1. doi: 10.1016/j.acra.2024.09.041. Epub ahead of print. PMID: 39353826.
7.Ueda D, Mitsuyama Y, Takita H, Horiuchi D, Walston SL,Tatekawa H, et al. ChatGPT’s Diagnostic Performance fromPatient History and Imaging Findings on the Diagnosis Please Quizzes. Radiology 2023;308(1).
8.Horiuchi D, Tatekawa H, Shimono T, Walston SL, Takita H, Matsushita S, et al. Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology 2024;66(1):73–9.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Radyoloji ve Organ Görüntüleme

Bölüm

Araştırma Makalesi

Yazarlar

Hasan Emin Kaya ^*
0000-0002-7411-4102
Türkiye

Dilek Sağlam
0000-0002-5778-6847
Türkiye

Zeynep Yazıcı
0000-0002-8647-5298
Türkiye

Gökhan Gökalp
0000-0002-3682-2474
Türkiye

Yayımlanma Tarihi

28 Ağustos 2025

Gönderilme Tarihi

10 Mart 2025

Kabul Tarihi

31 Temmuz 2025

Yayımlandığı Sayı

Yıl 2025 Cilt: 51 Sayı: 2

DOI

https://doi.org/10.32708/uutfd.1653680

IZ

https://izlik.org/JA55MJ25KC

Kaynak Göster

RIS / Bibtex

APA

Kaya, H. E., Sağlam, D., Yazıcı, Z., & Gökalp, G. (2025). Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports. Journal of Uludağ University Medical Faculty, 51(2), 305-309. https://doi.org/10.32708/uutfd.1653680

AMA

1.Kaya HE, Sağlam D, Yazıcı Z, Gökalp G. Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports. Uludağ Tıp Derg. 2025;51(2):305-309. doi:10.32708/uutfd.1653680

Chicago

Kaya, Hasan Emin, Dilek Sağlam, Zeynep Yazıcı, ve Gökhan Gökalp. 2025. “Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports”. Journal of Uludağ University Medical Faculty 51 (2): 305-9. https://doi.org/10.32708/uutfd.1653680.

EndNote

Kaya HE, Sağlam D, Yazıcı Z, Gökalp G (01 Ağustos 2025) Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports. Journal of Uludağ University Medical Faculty 51 2 305–309.

IEEE

[1]H. E. Kaya, D. Sağlam, Z. Yazıcı, ve G. Gökalp, “Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports”, Uludağ Tıp Derg, c. 51, sy 2, ss. 305–309, Ağu. 2025, doi: 10.32708/uutfd.1653680.

ISNAD

Kaya, Hasan Emin - Sağlam, Dilek - Yazıcı, Zeynep - Gökalp, Gökhan. “Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports”. Journal of Uludağ University Medical Faculty 51/2 (01 Ağustos 2025): 305-309. https://doi.org/10.32708/uutfd.1653680.

JAMA

1.Kaya HE, Sağlam D, Yazıcı Z, Gökalp G. Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports. Uludağ Tıp Derg. 2025;51:305–309.

MLA

Kaya, Hasan Emin, vd. “Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports”. Journal of Uludağ University Medical Faculty, c. 51, sy 2, Ağustos 2025, ss. 305-9, doi:10.32708/uutfd.1653680.

Vancouver

1.Hasan Emin Kaya, Dilek Sağlam, Zeynep Yazıcı, Gökhan Gökalp. Evaluating the Performance of Large Language Models in Generating Impressions for Radiology Reports. Uludağ Tıp Derg. 01 Ağustos 2025;51(2):305-9. doi:10.32708/uutfd.1653680