Comparative Performance Evaluation of Multimodal Large Language Models, Radiologist, and Anatomist in Visual Neuroanatomy Questions
Yıl 2024,
Cilt: 50 Sayı: 3, 551 - 556, 12.01.2025
Yasin Celal Güneş
,
Mehmet Ülkir
Öz
This study examined the performance of four different multimodal Large Language Models (LLMs)—GPT4-V, GPT-4o, LLaVA, and Gemini 1.5 Flash—on multiple-choice visual neuroanatomy questions, comparing them to a radiologist and an anatomist. The study employed a cross-sectional design and evaluated responses to 100 visual questions sourced from the Radiopaedia website. The accuracy of the responses was analyzed using the McNemar test. According to the results, the radiologist demonstrated the highest performance with an accuracy rate of 90%, while the anatomist achieved an accuracy rate of 67%. Among the multimodal LLMs, GPT-4o performed the best, with an accuracy rate of 45%, followed by Gemini 1.5 Flash at 35%, ChatGPT4-V at 22%, and LLaVA at 15%. The radiologist significantly outperformed both the anatomist and all multimodal LLMs (p<0.001). GPT-4o significantly outperformed GPT4-V and LLaVA (p<0.001), but no significant difference was found between GPT-4o and Gemini 1.5 Flash (p=0.123). However, Gemini 1.5 Flash showed significant superiority over LLaVA (p<0.001) and also demonstrated a statistically significant difference compared to GPT4-V (p=0.004). This study highlights the significant performance gap between multimodal LLMs and medical professionals. While multimodal LLMs hold great potential in the medical field, they have not yet reached the level of accuracy of medical experts in correctly identifying neuroanatomical regions.
Etik Beyan
Ethics Committee Approval Information
This study does not require ethics committee approval as it was conducted using publicly available internet data, and the images do not contain patient information. The study was carried out in accordance with the Standards for Reporting of Diagnostic Accuracy (STARD) and the Checklist for Artificial Intelligence in Medical Imaging (CLAIM).
Teşekkür
The authors used ChatGPT 4o (September 2024 Release; OpenAI; https://chat.openai.com/) to review grammar and English translation. The content of the publication is entirely the responsibility of the authors, who have reviewed and edited it as they deemed necessary.
The authors thank Juliette Hancox, Visual Licensing Manager at Radiopaedia.org, for granting permission to use the images from the associated website.
Kaynakça
- 1. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, Löffler CML, Schwarzkopf SC, Unger M, Veldhuizen GP, Wagner SJ, Kather JN (2023) The future landscape of large language models in medicine. Commun Med (Lond) 3:141. https://doi.org/10.1038/s43856-023-00370-1
- 2. GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. OpenAI.https://openai.com/gpt-4 / GPT-4V(ision) System Card. OpenAI. Accessed Date Accessed
- 3. Liu H, Li C, Wu Q, Lee YJ (2024) Visual instruction tuning. Adv Neural Inf Process Syst 36
- 4. https://deepmind.google/technologies/gemini/flash/. Accessed Date Accessed
- 5. https://openai.com/index/hello-gpt-4o/. Accessed Date Accessed
- 6. Kuang Y-R, Zou M-X, Niu H-Q, Zheng B-Y, Zhang T-L, Zheng B-W (2023) ChatGPT encounters multiple opportunities and challenges in neurosurgery. Int J Surg 109:2886-2891. https://doi.org/doi: 10.1097/JS9.0000000000000571
- 7. Gunes YC, Camur E, Cesur T (2024) Correspondence on ‘Evaluation of ChatGPT in knowledge of newly evolving neurosurgery: middle meningeal artery embolization for subdural hematoma management’by Koester et al. J Neurointerv Surg
- 8. Andykarayalar R, Surapaneni KM (2024) ChatGPT in Pediatrics: Unraveling Its Significance as a Clinical Decision Support Tool. Indian Pediatr 61:357-358
- 9. Dinis-Oliveira RJ, Azevedo RM (2023) ChatGPT in forensic sciences: a new Pandora’s box with advantages and challenges to pay attention. Forensic Sci Res 8:275-279. https://doi.org/doi: 10.1093/fsr/owad039
- 10. Elkassem AA, Smith AD (2023) Potential use cases for ChatGPT in radiology reporting. American Journal of Roentgenology 221:373-376. https://doi.org/10.2214/AJR.23.29198
- 11. Ferreira AL, Chu B, Grant-Kels JM, Ogunleye T, Lipoff JB (2023) Evaluation of ChatGPT dermatology responses to common patient queries. JMIR dermatol 6:e49280. https://doi.org/doi: 10.2196/49280
- 12. Ilgaz HB, Çelik Z (2023) The significance of artificial intelligence platforms in anatomy education: an experience with ChatGPT and google bard. Cureus 15:e45301. https://doi.org/doi: 10.7759/cureus.45301
- 13. Langlie J, Kamrava B, Pasick LJ, Mei C, Hoffer ME (2024) Artificial intelligence and ChatGPT: An otolaryngology patient's ally or foe? Am J Otolaryngol 45:104220. https://doi.org/doi: 10.1016/j.amjoto.2024.104220
- 14. Cirone K, Akrout M, Abid L, Oakley A (2024) Assessing the utility of multimodal large language models (GPT-4 vision and large language and vision assistant) in identifying melanoma across different skin tones. JMIR dermatol 7:e55508. https://doi.org/doi: 10.2196/55508
- 15. Deng J, Heybati K, Shammas-Toma M (2024) When vision meets reality: Exploring the clinical applicability of GPT-4 with vision. Clin Imaging 108:110101. https://doi.org/doi: 10.1016/j.clinimag.2024.110101
- 16. Nakao T, Miki S, Nakamura Y, Kikuchi T, Nomura Y, Hanaoka S, Yoshikawa T, Abe O (2024) Capability of GPT-4V (ision) in the Japanese National Medical Licensing Examination: Evaluation Study. JMIR Med Educ 10:e54393. https://doi.org/doi: 10.2196/54393
- 17. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Lijmer JG, Moher D, Rennie D, De Vet HC (2015) STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology 277:826-832. https://doi.org/doi: 10.1136/bmj.h5527
- 18. Mongan J, Moy L, Kahn CE, Jr. (2020) Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. Radiol Artif Intell 2:e200029. https://doi.org/10.1148/ryai.2020200029
- 19. Bolgova O, Shypilova I, Sankova L, Mavrych V (2023) How Well Did ChatGPT Perform in Answering Questions on Different Topics in Gross Anatomy? EJMED 5:94-100. https://doi.org/doi: 10.24018/ejmed.2023.5.6.1989
- 20. Lee H (2023) The rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ. https://doi.org/doi: 10.1002/ase.2270
- 21. Mogali SR (2024) Initial impressions of ChatGPT for anatomy education. Anat Sci Educ 17:444-447. https://doi.org/doi: 10.1002/ase.2261
- 22. Totlis T, Natsis K, Filos D, Ediaroglou V, Mantzou N, Duparc F, Piagkou M (2023) The potential role of ChatGPT and artificial intelligence in anatomy education: a conversation with ChatGPT. Surg Radiol Anat 45:1321-1329
- 23. Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, Naumann T, Poon H, Gao J (2024) Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv Neural Inf Process Syst 36. https://doi.org/doi: 10.48550/arXiv.2306.00890
- 24. Monajatipoor M, Rouhsedaghat M, Li LH, Jay Kuo C-C, Chien A, Chang K-W (2022) Berthop: An effective vision-and-language model for chest x-ray disease diagnosis. International Conference on Medical Image Computing and Computer-Assisted Intervention:725-734. https://doi.org/doi: 10.48550/arXiv.2108.04938
- 25. Zhu L, Mou W, Lai Y, Chen J, Lin S, Xu L, Lin J, Guo Z, Yang T, Lin A (2024) Step into the era of large multimodal models: A pilot study on ChatGPT-4V (ision)‘s ability to interpret radiological images. Int J Surg:10.1097. https://doi.org/doi: 10.1097/JS9.0000000000001359
- 26. Noda M, Ueno T, Koshu R, Takaso Y, Shimada MD, Saito C, Sugimoto H, Fushiki H, Ito M, Nomura A (2024) Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study. JMIR Med Educ 10:e57054. https://doi.org/doi: 10.2196/57054
- 27. Akrout M, Cirone KD, Vender R (2024) Evaluation of Vision LLMs GTP-4V and LLaVA for the Recognition of Features Characteristic of Melanoma. J Cutan Med Surg 28:98-99. https://doi.org/doi: 10.1177/12034754231220934
Çok Modlu Büyük Dil Modelleri, Bir Radyolog ve Bir Anatomistin Görsel Nöroanatomi Sorularındaki Karşılaştırmalı Performans Değerlendirmesi
Yıl 2024,
Cilt: 50 Sayı: 3, 551 - 556, 12.01.2025
Yasin Celal Güneş
,
Mehmet Ülkir
Öz
Bu çalışma, dört farklı çok modlu Büyük Dil Modeli'nin (GPT4-V, GPT-4o, LLaVA, Gemini 1.5 Flash) görsel nöroanatomi çoktan seçmeli sorularındaki performansını, bir radyolog ve bir anatomistle karşılaştırarak incelemiştir. Kesitsel bir araştırma dizaynına dayanan çalışmada, Radiopaedia web sitesinden alınan 100 görsel soruya verilen yanıtlar değerlendirilmiştir. Yanıtların doğruluğu McNemar testi kullanılarak analiz edilmiştir. Sonuçlara göre, radyolog %90 doğruluk oranı ile en yüksek performansı sergilerken, anatomist %67 doğruluk oranı elde etmiştir. Çok modlu LLM'ler arasında en iyi performansı %45 doğruluk oranı ile GPT-4o göstermiştir; onu %35 ile Gemini 1.5 Flash, %22 ile ChatGPT4-V ve %15 ile LLaVA takip etmiştir. Radyolog, hem anatomiste hem de tüm çok modlu LLM'lere kıyasla anlamlı derecede üstün bir performans sergilemiştir (p<0.001). GPT-4o, GPT4-V ve LLaVA'ya kıyasla anlamlı derecede daha iyi bir performans göstermiş (p<0.001), ancak Gemini 1.5 Flash ile arasında anlamlı bir fark gözlenmemiştir (p=0.123). Bununla birlikte, Gemini 1.5 Flash, LLaVA'ya karşı anlamlı bir üstünlük sağlamış (p<0.001) ve GPT4-V ile karşılaştırıldığında da istatistiksel olarak anlamlı bir fark ortaya çıkmıştır (p=0.004). Bu çalışma, çok modlu LLM'ler ile tıbbi uzmanlar arasındaki belirgin performans farkını ortaya koymaktadır. Çok modlu LLM'ler tıp alanında büyük bir potansiyel vaat etse de, nöroanatomik bölgeleri doğru bir şekilde tanımlama konusunda henüz tıbbi uzmanların doğruluk seviyesine ulaşamamaktadırlar.
Etik Beyan
Etik Kurul Onay Bilgisi
Bu çalışma herkesin kullanımına açık internet verileri kullanılarak yapıldığı için ve görüntüler hasta bilgisi içermediği için çalışmanın etik kurul gereksinimi bulunmamaktadır. Çalışma Tanısal Doğruluk Çalışmalarının Raporlanması Standartları'na (STARD) ve Tıbbi Görüntülemede Yapay Zeka için Kontrol Listesi'ne (CLAIM) uygun olarak yürütülmüştür.
Teşekkür
Yazarlar, dilbilgisi ve İngilizce çeviriyi gözden geçirmek için ChatGPT 4o'yu (Eylül 2024 Sürümü; OpenAI; https://chat.openai.com/) kullanmıştır. Yayının içeriği tamamen yazarların sorumluluğundadır ve yazarlar gerekli gördükleri şekilde içeriği inceleyip düzenlemişlerdir.
Yazarlar Radiopaedia.org'da Görsel Lisans Yöneticisi olan Juliette Hancox'a, ilgili web sitesindeki görsellerin kullanımına izin verdiği için teşekkürlerini sunmaktadır.
Kaynakça
- 1. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, Löffler CML, Schwarzkopf SC, Unger M, Veldhuizen GP, Wagner SJ, Kather JN (2023) The future landscape of large language models in medicine. Commun Med (Lond) 3:141. https://doi.org/10.1038/s43856-023-00370-1
- 2. GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. OpenAI.https://openai.com/gpt-4 / GPT-4V(ision) System Card. OpenAI. Accessed Date Accessed
- 3. Liu H, Li C, Wu Q, Lee YJ (2024) Visual instruction tuning. Adv Neural Inf Process Syst 36
- 4. https://deepmind.google/technologies/gemini/flash/. Accessed Date Accessed
- 5. https://openai.com/index/hello-gpt-4o/. Accessed Date Accessed
- 6. Kuang Y-R, Zou M-X, Niu H-Q, Zheng B-Y, Zhang T-L, Zheng B-W (2023) ChatGPT encounters multiple opportunities and challenges in neurosurgery. Int J Surg 109:2886-2891. https://doi.org/doi: 10.1097/JS9.0000000000000571
- 7. Gunes YC, Camur E, Cesur T (2024) Correspondence on ‘Evaluation of ChatGPT in knowledge of newly evolving neurosurgery: middle meningeal artery embolization for subdural hematoma management’by Koester et al. J Neurointerv Surg
- 8. Andykarayalar R, Surapaneni KM (2024) ChatGPT in Pediatrics: Unraveling Its Significance as a Clinical Decision Support Tool. Indian Pediatr 61:357-358
- 9. Dinis-Oliveira RJ, Azevedo RM (2023) ChatGPT in forensic sciences: a new Pandora’s box with advantages and challenges to pay attention. Forensic Sci Res 8:275-279. https://doi.org/doi: 10.1093/fsr/owad039
- 10. Elkassem AA, Smith AD (2023) Potential use cases for ChatGPT in radiology reporting. American Journal of Roentgenology 221:373-376. https://doi.org/10.2214/AJR.23.29198
- 11. Ferreira AL, Chu B, Grant-Kels JM, Ogunleye T, Lipoff JB (2023) Evaluation of ChatGPT dermatology responses to common patient queries. JMIR dermatol 6:e49280. https://doi.org/doi: 10.2196/49280
- 12. Ilgaz HB, Çelik Z (2023) The significance of artificial intelligence platforms in anatomy education: an experience with ChatGPT and google bard. Cureus 15:e45301. https://doi.org/doi: 10.7759/cureus.45301
- 13. Langlie J, Kamrava B, Pasick LJ, Mei C, Hoffer ME (2024) Artificial intelligence and ChatGPT: An otolaryngology patient's ally or foe? Am J Otolaryngol 45:104220. https://doi.org/doi: 10.1016/j.amjoto.2024.104220
- 14. Cirone K, Akrout M, Abid L, Oakley A (2024) Assessing the utility of multimodal large language models (GPT-4 vision and large language and vision assistant) in identifying melanoma across different skin tones. JMIR dermatol 7:e55508. https://doi.org/doi: 10.2196/55508
- 15. Deng J, Heybati K, Shammas-Toma M (2024) When vision meets reality: Exploring the clinical applicability of GPT-4 with vision. Clin Imaging 108:110101. https://doi.org/doi: 10.1016/j.clinimag.2024.110101
- 16. Nakao T, Miki S, Nakamura Y, Kikuchi T, Nomura Y, Hanaoka S, Yoshikawa T, Abe O (2024) Capability of GPT-4V (ision) in the Japanese National Medical Licensing Examination: Evaluation Study. JMIR Med Educ 10:e54393. https://doi.org/doi: 10.2196/54393
- 17. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Lijmer JG, Moher D, Rennie D, De Vet HC (2015) STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology 277:826-832. https://doi.org/doi: 10.1136/bmj.h5527
- 18. Mongan J, Moy L, Kahn CE, Jr. (2020) Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. Radiol Artif Intell 2:e200029. https://doi.org/10.1148/ryai.2020200029
- 19. Bolgova O, Shypilova I, Sankova L, Mavrych V (2023) How Well Did ChatGPT Perform in Answering Questions on Different Topics in Gross Anatomy? EJMED 5:94-100. https://doi.org/doi: 10.24018/ejmed.2023.5.6.1989
- 20. Lee H (2023) The rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ. https://doi.org/doi: 10.1002/ase.2270
- 21. Mogali SR (2024) Initial impressions of ChatGPT for anatomy education. Anat Sci Educ 17:444-447. https://doi.org/doi: 10.1002/ase.2261
- 22. Totlis T, Natsis K, Filos D, Ediaroglou V, Mantzou N, Duparc F, Piagkou M (2023) The potential role of ChatGPT and artificial intelligence in anatomy education: a conversation with ChatGPT. Surg Radiol Anat 45:1321-1329
- 23. Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, Naumann T, Poon H, Gao J (2024) Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv Neural Inf Process Syst 36. https://doi.org/doi: 10.48550/arXiv.2306.00890
- 24. Monajatipoor M, Rouhsedaghat M, Li LH, Jay Kuo C-C, Chien A, Chang K-W (2022) Berthop: An effective vision-and-language model for chest x-ray disease diagnosis. International Conference on Medical Image Computing and Computer-Assisted Intervention:725-734. https://doi.org/doi: 10.48550/arXiv.2108.04938
- 25. Zhu L, Mou W, Lai Y, Chen J, Lin S, Xu L, Lin J, Guo Z, Yang T, Lin A (2024) Step into the era of large multimodal models: A pilot study on ChatGPT-4V (ision)‘s ability to interpret radiological images. Int J Surg:10.1097. https://doi.org/doi: 10.1097/JS9.0000000000001359
- 26. Noda M, Ueno T, Koshu R, Takaso Y, Shimada MD, Saito C, Sugimoto H, Fushiki H, Ito M, Nomura A (2024) Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study. JMIR Med Educ 10:e57054. https://doi.org/doi: 10.2196/57054
- 27. Akrout M, Cirone KD, Vender R (2024) Evaluation of Vision LLMs GTP-4V and LLaVA for the Recognition of Features Characteristic of Melanoma. J Cutan Med Surg 28:98-99. https://doi.org/doi: 10.1177/12034754231220934