Research Article
BibTex RIS Cite

Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS v2018 Classification of Liver Lesions

Year 2025, Volume: 6 Issue: 2, 159 - 163, 28.07.2025

Abstract

Introduction: This study aimed to evaluate the diagnostic performance of GPT-4V, a vision-enabled large language model, in classifying liver lesions on MRI according to LI-RADS v2018 criteria.

Methods: Seventy contrast-enhanced liver MRI examinations were retrospectively selected, comprising 10 cases from each LI-RADS category. Each case was presented to GPT-4V as a standardised set of seven anonymised axial MRI slices, accompanied only by lesion size. The model was prompted to assign a single LI-RADS category based solely on visual input. The model’s performance was assessed using overall accuracy, Cohen’s kappa, ROC analysis, and correlation with lesion size.

Results: GPT-4V achieved an overall classification accuracy of 37.1%. While the accuracy for LR-5 was high (90%), it was notably poor in LR-3 (0%) and LR-4 (20%). More than half of the lesions were misclassified as LR-5 (54.2%). Binary classification into benign (LR-1 and LR-2) versus malignant (LR-4, LR-5, LR-M, LR-TIV) yielded an accuracy of 84.3%, with an AUC of 0.72, sensitivity of 100%, and specificity of 45%. Cohen’s kappa values were 0.27 for detailed classification and 0.54 for benign–malignant grouping. Lesion size positively correlated with classification accuracy (ρ = 0.26, p = 0.031). The model demonstrated a tendency to favour high-certainty categories, often defaulting to LR-5 when diagnostic ambiguity was present.

Conclusions: GPT-4V demonstrated limited performance in detailed LI-RADS classification, with a strong bias toward LR-5 and poor accuracy in intermediate categories. While the model demonstrated relatively better performance in distinguishing benign from malignant lesions, its current form remains inadequate for precise image-based categorisation. Further development with structured visual training and clinically contextual prompting is warranted.

Ethical Statement

This study was conducted in accordance with the principles of the Declaration of Helsinki and was approved by the Non-Interventional Clinical Research Ethics Committee of Adnan Menderes University Faculty of Medicine (Date: 25.03.2025; Approval No: 709857).

References

  • Elsayes KM, Hooker JC, Agrons MM, et al. 2017 Version of LI-RADS for CT and MR Imaging: An Update. Radiographics. 2017;37:1994-2017.
  • Chernyak V, Fowler KJ, Kamaya A, et al. Liver Imaging Reporting and Data System (LI-RADS) Version 2018: Imaging of Hepatocellular Carcinoma in At-Risk Patients. Radiology. 2018;289:816-30.
  • Chernyak V, Tang A, Do RKG, et al. Liver imaging: it is time to adopt standardized terminology. Eur Radiol. 2022;32:6291-301.
  • Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44-56.
  • Brin D, Sorin V, Barash Y, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 2025;35:1959-65.
  • Horiuchi D, Tatekawa H, Oura T, et al. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol. 2025;35:506-16.
  • Mitsuyama Y, Tatekawa H, Takita H, et al. Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors. Eur Radiol. 2025;35:1938-47.
  • Reith TP, D'Alessandro DM, D'Alessandro MP. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr Radiol. 2024;54:1729-37.
  • Huppertz MS, Siepmann R, Topp D, et al. Revolution or risk?-Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol. 2025;35:1111-21.
  • Yang Z, Li L, Lin K et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv 2023 [E-pub ahead of print], doi.org/10.48550/arXiv.2309. 17421.
  • Hirosawa T, Harada Y, Tokumasu K, Ito T, Suzuki T, Shimizu T. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration. JMIR Med Inform. 2024;12:55627.
  • Gu K, Lee JH, Shin J, et al. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int. 2024;44:1578-87.
  • Fervers P, Hahnfeldt R, Kottlors J, et al. ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language. Front Radiol. 2024;4:1390774.
  • Matute-González M, Darnell A, Comas-Cufí M, et al. Utilizing a domain-specific large language model for LI-RADS v2018 categorization of free-text MRI reports: a feasibility study. Insights Imaging. 2024;15:280.
  • Huang J, Yang R, Huang X, et al. Feasibility of large language models for CEUS LI-RADS categorization of small liver nodules in high-risk patients. Front Oncol. 2024;14:1513608

Karaciğer Lezyonlarının MRG Tabanlı LI-RADS v2018 Sınıflandırmasında Yapay Zeka Performansının Değerlendirilmesi

Year 2025, Volume: 6 Issue: 2, 159 - 163, 28.07.2025

Abstract

Giriş: Bu çalışmanın amacı, görsel verileri analiz edebilen büyük bir dil modeli olan GPT-4V’nin, karaciğer MRG lezyonlarını LI-RADS v2018 kriterlerine göre sınıflandırmadaki tanısal performansını değerlendirmektir.

Yöntem: Her LI-RADS kategorisinden 10’ar vaka olmak üzere toplam 70 kontrastlı karaciğer MRG incelemesi retrospektif olarak seçildi. Her vaka, yalnızca lezyon çapı ile birlikte yedi adet anonimleştirilmiş aksiyel MRG kesitinden oluşan standart bir görüntü seti şeklinde GPT-4V’ye sunuldu. Modelden yalnızca görsel bilgiye dayanarak tek bir LI-RADS kategorisi ataması yapması istendi. Modelin performansı doğruluk oranı, Cohen kappa katsayısı, ROC analizi ve lezyon boyutu ile korelasyon yoluyla değerlendirildi.

Bulgular: GPT-4V’nin genel sınıflandırma doğruluğu %37,1 olarak saptandı. LR-5 kategorisinde doğruluk yüksek (%90) iken, LR-3 (%0) ve LR-4 (%20) kategorilerinde oldukça düşüktü. Lezyonların %54,2’si hatalı şekilde LR-5 olarak sınıflandırıldı. Benign (LR-1 ve LR-2) ile malign (LR-4, LR-5, LR-M, LR-TIV) ayrımında modelin doğruluğu %84,3; AUC değeri 0.72; duyarlılığı %100 ve özgüllüğü %45 olarak hesaplandı. Cohen kappa değeri detaylı sınıflama için 0.27, benign-malign gruplaması için ise 0,54 olarak bulundu. Lezyon boyutu ile sınıflama doğruluğu arasında pozitif korelasyon izlendi (ρ = 0.26, p = 0.031). Model, tanısal belirsizlik durumlarında sıklıkla LR-5 kategorisine yönelme eğilimi göstermiştir.

Sonuç: GPT-4V, ayrıntılı LI-RADS sınıflandırmasında sınırlı performans sergilemiş, özellikle LR-5 kategorisine belirgin bir yönelim ve ara kategorilerde düşük doğruluk göstermiştir. Her ne kadar benign–malign ayrımında kısmen daha başarılı olsa da modelin mevcut hali görüntüye dayalı kesin sınıflamalar için yetersizdir. Daha yapılandırılmış görsel veri eğitimi ve klinik bağlam içeren istemlerle geliştirilmesi gereklidir.

References

  • Elsayes KM, Hooker JC, Agrons MM, et al. 2017 Version of LI-RADS for CT and MR Imaging: An Update. Radiographics. 2017;37:1994-2017.
  • Chernyak V, Fowler KJ, Kamaya A, et al. Liver Imaging Reporting and Data System (LI-RADS) Version 2018: Imaging of Hepatocellular Carcinoma in At-Risk Patients. Radiology. 2018;289:816-30.
  • Chernyak V, Tang A, Do RKG, et al. Liver imaging: it is time to adopt standardized terminology. Eur Radiol. 2022;32:6291-301.
  • Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44-56.
  • Brin D, Sorin V, Barash Y, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 2025;35:1959-65.
  • Horiuchi D, Tatekawa H, Oura T, et al. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol. 2025;35:506-16.
  • Mitsuyama Y, Tatekawa H, Takita H, et al. Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors. Eur Radiol. 2025;35:1938-47.
  • Reith TP, D'Alessandro DM, D'Alessandro MP. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr Radiol. 2024;54:1729-37.
  • Huppertz MS, Siepmann R, Topp D, et al. Revolution or risk?-Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol. 2025;35:1111-21.
  • Yang Z, Li L, Lin K et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv 2023 [E-pub ahead of print], doi.org/10.48550/arXiv.2309. 17421.
  • Hirosawa T, Harada Y, Tokumasu K, Ito T, Suzuki T, Shimizu T. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration. JMIR Med Inform. 2024;12:55627.
  • Gu K, Lee JH, Shin J, et al. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int. 2024;44:1578-87.
  • Fervers P, Hahnfeldt R, Kottlors J, et al. ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language. Front Radiol. 2024;4:1390774.
  • Matute-González M, Darnell A, Comas-Cufí M, et al. Utilizing a domain-specific large language model for LI-RADS v2018 categorization of free-text MRI reports: a feasibility study. Insights Imaging. 2024;15:280.
  • Huang J, Yang R, Huang X, et al. Feasibility of large language models for CEUS LI-RADS categorization of small liver nodules in high-risk patients. Front Oncol. 2024;14:1513608
There are 15 citations in total.

Details

Primary Language English
Subjects Radiology and Organ Imaging
Journal Section Research Articles
Authors

Ahmet Tanyeri 0000-0002-1097-1172

Rıdvan Akbulut 0009-0004-8091-6322

Aygün Katmerlikaya 0009-0004-7868-9490

Aral Varol 0009-0009-4231-6824

Göksel Tuzcu 0000-0002-3957-1770

Tuna Şahin 0000-0002-5366-1510

Publication Date July 28, 2025
Submission Date April 15, 2025
Acceptance Date July 17, 2025
Published in Issue Year 2025 Volume: 6 Issue: 2

Cite

APA Tanyeri, A., Akbulut, R., Katmerlikaya, A., … Varol, A. (2025). Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS v2018 Classification of Liver Lesions. Eskisehir Medical Journal, 6(2), 159-163.
AMA Tanyeri A, Akbulut R, Katmerlikaya A, Varol A, Tuzcu G, Şahin T. Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS v2018 Classification of Liver Lesions. Eskisehir Med J. July 2025;6(2):159-163.
Chicago Tanyeri, Ahmet, Rıdvan Akbulut, Aygün Katmerlikaya, Aral Varol, Göksel Tuzcu, and Tuna Şahin. “Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS V2018 Classification of Liver Lesions”. Eskisehir Medical Journal 6, no. 2 (July 2025): 159-63.
EndNote Tanyeri A, Akbulut R, Katmerlikaya A, Varol A, Tuzcu G, Şahin T (July 1, 2025) Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS v2018 Classification of Liver Lesions. Eskisehir Medical Journal 6 2 159–163.
IEEE A. Tanyeri, R. Akbulut, A. Katmerlikaya, A. Varol, G. Tuzcu, and T. Şahin, “Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS v2018 Classification of Liver Lesions”, Eskisehir Med J, vol. 6, no. 2, pp. 159–163, 2025.
ISNAD Tanyeri, Ahmet et al. “Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS V2018 Classification of Liver Lesions”. Eskisehir Medical Journal 6/2 (July2025), 159-163.
JAMA Tanyeri A, Akbulut R, Katmerlikaya A, Varol A, Tuzcu G, Şahin T. Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS v2018 Classification of Liver Lesions. Eskisehir Med J. 2025;6:159–163.
MLA Tanyeri, Ahmet et al. “Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS V2018 Classification of Liver Lesions”. Eskisehir Medical Journal, vol. 6, no. 2, 2025, pp. 159-63.
Vancouver Tanyeri A, Akbulut R, Katmerlikaya A, Varol A, Tuzcu G, Şahin T. Evaluating the Performance of Artificial Intelligence in MRI-Based LI-RADS v2018 Classification of Liver Lesions. Eskisehir Med J. 2025;6(2):159-63.