Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS v2018 Classification of Liver Lesions
Year 2025,
Volume: 6 Issue: 2, 159 - 163, 28.07.2025
Ahmet Tanyeri
,
Rıdvan Akbulut
,
Aygün Katmerlikaya
,
Aral Varol
,
Göksel Tuzcu
,
Tuna Şahin
Abstract
Introduction: This study aimed to evaluate the diagnostic performance of GPT-4V, a vision-enabled large language model, in classifying liver lesions on MRI according to LI-RADS v2018 criteria.
Methods: Seventy contrast-enhanced liver MRI examinations were retrospectively selected, comprising 10 cases from each LI-RADS category. Each case was presented to GPT-4V as a standardised set of seven anonymised axial MRI slices, accompanied only by lesion size. The model was prompted to assign a single LI-RADS category based solely on visual input. The model’s performance was assessed using overall accuracy, Cohen’s kappa, ROC analysis, and correlation with lesion size.
Results: GPT-4V achieved an overall classification accuracy of 37.1%. While the accuracy for LR-5 was high (90%), it was notably poor in LR-3 (0%) and LR-4 (20%). More than half of the lesions were misclassified as LR-5 (54.2%). Binary classification into benign (LR-1 and LR-2) versus malignant (LR-4, LR-5, LR-M, LR-TIV) yielded an accuracy of 84.3%, with an AUC of 0.72, sensitivity of 100%, and specificity of 45%. Cohen’s kappa values were 0.27 for detailed classification and 0.54 for benign–malignant grouping. Lesion size positively correlated with classification accuracy (ρ = 0.26, p = 0.031). The model demonstrated a tendency to favour high-certainty categories, often defaulting to LR-5 when diagnostic ambiguity was present.
Conclusions: GPT-4V demonstrated limited performance in detailed LI-RADS classification, with a strong bias toward LR-5 and poor accuracy in intermediate categories. While the model demonstrated relatively better performance in distinguishing benign from malignant lesions, its current form remains inadequate for precise image-based categorisation. Further development with structured visual training and clinically contextual prompting is warranted.
Ethical Statement
This study was conducted in accordance with the principles of the Declaration of Helsinki and was approved by the Non-Interventional Clinical Research Ethics Committee of Adnan Menderes University Faculty of Medicine (Date: 25.03.2025; Approval No: 709857).
References
-
Elsayes KM, Hooker JC, Agrons MM, et al. 2017 Version of LI-RADS for CT and MR Imaging: An Update. Radiographics. 2017;37:1994-2017.
-
Chernyak V, Fowler KJ, Kamaya A, et al. Liver Imaging Reporting and Data System (LI-RADS) Version 2018: Imaging of Hepatocellular Carcinoma in At-Risk Patients. Radiology. 2018;289:816-30.
-
Chernyak V, Tang A, Do RKG, et al. Liver imaging: it is time to adopt standardized terminology. Eur Radiol. 2022;32:6291-301.
-
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44-56.
-
Brin D, Sorin V, Barash Y, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 2025;35:1959-65.
-
Horiuchi D, Tatekawa H, Oura T, et al. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol. 2025;35:506-16.
-
Mitsuyama Y, Tatekawa H, Takita H, et al. Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors. Eur Radiol. 2025;35:1938-47.
-
Reith TP, D'Alessandro DM, D'Alessandro MP. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr Radiol. 2024;54:1729-37.
-
Huppertz MS, Siepmann R, Topp D, et al. Revolution or risk?-Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol. 2025;35:1111-21.
-
Yang Z, Li L, Lin K et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv 2023 [E-pub ahead of print], doi.org/10.48550/arXiv.2309. 17421.
-
Hirosawa T, Harada Y, Tokumasu K, Ito T, Suzuki T, Shimizu T. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration. JMIR Med Inform. 2024;12:55627.
-
Gu K, Lee JH, Shin J, et al. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int. 2024;44:1578-87.
-
Fervers P, Hahnfeldt R, Kottlors J, et al. ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language. Front Radiol. 2024;4:1390774.
-
Matute-González M, Darnell A, Comas-Cufí M, et al. Utilizing a domain-specific large language model for LI-RADS v2018 categorization of free-text MRI reports: a feasibility study. Insights Imaging. 2024;15:280.
-
Huang J, Yang R, Huang X, et al. Feasibility of large language models for CEUS LI-RADS categorization of small liver nodules in high-risk patients. Front Oncol. 2024;14:1513608
Karaciğer Lezyonlarının MRG Tabanlı LI-RADS v2018 Sınıflandırmasında Yapay Zeka Performansının Değerlendirilmesi
Year 2025,
Volume: 6 Issue: 2, 159 - 163, 28.07.2025
Ahmet Tanyeri
,
Rıdvan Akbulut
,
Aygün Katmerlikaya
,
Aral Varol
,
Göksel Tuzcu
,
Tuna Şahin
Abstract
Giriş: Bu çalışmanın amacı, görsel verileri analiz edebilen büyük bir dil modeli olan GPT-4V’nin, karaciğer MRG lezyonlarını LI-RADS v2018 kriterlerine göre sınıflandırmadaki tanısal performansını değerlendirmektir.
Yöntem: Her LI-RADS kategorisinden 10’ar vaka olmak üzere toplam 70 kontrastlı karaciğer MRG incelemesi retrospektif olarak seçildi. Her vaka, yalnızca lezyon çapı ile birlikte yedi adet anonimleştirilmiş aksiyel MRG kesitinden oluşan standart bir görüntü seti şeklinde GPT-4V’ye sunuldu. Modelden yalnızca görsel bilgiye dayanarak tek bir LI-RADS kategorisi ataması yapması istendi. Modelin performansı doğruluk oranı, Cohen kappa katsayısı, ROC analizi ve lezyon boyutu ile korelasyon yoluyla değerlendirildi.
Bulgular: GPT-4V’nin genel sınıflandırma doğruluğu %37,1 olarak saptandı. LR-5 kategorisinde doğruluk yüksek (%90) iken, LR-3 (%0) ve LR-4 (%20) kategorilerinde oldukça düşüktü. Lezyonların %54,2’si hatalı şekilde LR-5 olarak sınıflandırıldı. Benign (LR-1 ve LR-2) ile malign (LR-4, LR-5, LR-M, LR-TIV) ayrımında modelin doğruluğu %84,3; AUC değeri 0.72; duyarlılığı %100 ve özgüllüğü %45 olarak hesaplandı. Cohen kappa değeri detaylı sınıflama için 0.27, benign-malign gruplaması için ise 0,54 olarak bulundu. Lezyon boyutu ile sınıflama doğruluğu arasında pozitif korelasyon izlendi (ρ = 0.26, p = 0.031). Model, tanısal belirsizlik durumlarında sıklıkla LR-5 kategorisine yönelme eğilimi göstermiştir.
Sonuç: GPT-4V, ayrıntılı LI-RADS sınıflandırmasında sınırlı performans sergilemiş, özellikle LR-5 kategorisine belirgin bir yönelim ve ara kategorilerde düşük doğruluk göstermiştir. Her ne kadar benign–malign ayrımında kısmen daha başarılı olsa da modelin mevcut hali görüntüye dayalı kesin sınıflamalar için yetersizdir. Daha yapılandırılmış görsel veri eğitimi ve klinik bağlam içeren istemlerle geliştirilmesi gereklidir.
References
-
Elsayes KM, Hooker JC, Agrons MM, et al. 2017 Version of LI-RADS for CT and MR Imaging: An Update. Radiographics. 2017;37:1994-2017.
-
Chernyak V, Fowler KJ, Kamaya A, et al. Liver Imaging Reporting and Data System (LI-RADS) Version 2018: Imaging of Hepatocellular Carcinoma in At-Risk Patients. Radiology. 2018;289:816-30.
-
Chernyak V, Tang A, Do RKG, et al. Liver imaging: it is time to adopt standardized terminology. Eur Radiol. 2022;32:6291-301.
-
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44-56.
-
Brin D, Sorin V, Barash Y, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 2025;35:1959-65.
-
Horiuchi D, Tatekawa H, Oura T, et al. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol. 2025;35:506-16.
-
Mitsuyama Y, Tatekawa H, Takita H, et al. Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors. Eur Radiol. 2025;35:1938-47.
-
Reith TP, D'Alessandro DM, D'Alessandro MP. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr Radiol. 2024;54:1729-37.
-
Huppertz MS, Siepmann R, Topp D, et al. Revolution or risk?-Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol. 2025;35:1111-21.
-
Yang Z, Li L, Lin K et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv 2023 [E-pub ahead of print], doi.org/10.48550/arXiv.2309. 17421.
-
Hirosawa T, Harada Y, Tokumasu K, Ito T, Suzuki T, Shimizu T. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration. JMIR Med Inform. 2024;12:55627.
-
Gu K, Lee JH, Shin J, et al. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int. 2024;44:1578-87.
-
Fervers P, Hahnfeldt R, Kottlors J, et al. ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language. Front Radiol. 2024;4:1390774.
-
Matute-González M, Darnell A, Comas-Cufí M, et al. Utilizing a domain-specific large language model for LI-RADS v2018 categorization of free-text MRI reports: a feasibility study. Insights Imaging. 2024;15:280.
-
Huang J, Yang R, Huang X, et al. Feasibility of large language models for CEUS LI-RADS categorization of small liver nodules in high-risk patients. Front Oncol. 2024;14:1513608