Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis

Alpaslan Koç; Ayşe Betül Öztiryaki

doi:10.28948/ngumuh.1492129

Araştırma Makalesi

Tıbbi görüntüleme sistemlerinde Gemini Advanced, GPT-4, Copilot ve GPT-3.5 modellerinin doğruluk performanslarının karşılaştırılması: Sıfır atışlı yönlendirme analizi

Yıl 2024, Cilt: 13 Sayı: 4, 1216 - 1223, 15.10.2024

Alpaslan Koç , Ayşe Betül Öztiryaki

https://doi.org/10.28948/ngumuh.1492129

Cited By: 1

https://izlik.org/JA75HK28GU

Öz

Büyük dil modelleri (LLM'ler) sağlık hizmetlerinde popülerlik kazanmış ve çeşitli tıbbi uzmanlık alanlarındaki araştırmacıların ilgisini çekmektedir. Doğru sonuçlar için hangi modelin hangi koşullarda iyi performans gösterdiğini belirlemek önemlidir. Bu çalışma, yeni geliştirilen büyük dil modellerinin tıbbi görüntüleme sistemleri için doğruluklarını karşılaştırmayı ve bu modellerin verdikleri doğru yanıtlar açısından birbirleri arasındaki uyumluluklarını değerlendirmeyi amaçlamaktadır. Bu değerlendirme için toplam 400 soru X-ray, ultrason, manyetik rezonans görüntüleme ve nükleer tıp görüntüleme olarak dört kategoriye ayrılmıştır. Büyük dil modellerinin yanıtları, doğru yanıtların yüzdesi ölçülerek sıfır-atışlı yönlendirme yaklaşımıyla değerlendirilmiştir. Modeller arasındaki farkların anlamlılığını değerlendirmek için McNemar testi, modellerin güvenilirliğini belirlemek için ise Cohen kappa istatistiği kullanılmıştır. Gemini Advanced, GPT-4, Copilot ve GPT-3.5 için sırasıyla %86.25, %84.25, %77.5 ve %59.75 doğruluk oranları elde edilmiştir. Diğer modellerle karşılaştırıldığında Gemini Advanced ve GPT-4 arasında güçlü bir korelasyon bulunmuştur, К=0,762. Bu çalışma, yakın zamanda geliştirilen Gemini Advanced, GPT-4, Copilot ve GPT-3.5'in tıbbi görüntüleme sistemleriyle ilgili sorulara verdiği yanıtların doğruluğunu analiz eden ilk çalışmadır. Ayrıca bu çalışma ile tıbbi görüntüleme sistemleri ile ilgili çeşitli kaynaklardan üç soru tipinden oluşan kapsamlı bir veri seti oluşturulmuştur.

Anahtar Kelimeler

Büyük dil modelleri , Tıbbi görüntüleme sistemleri , Üretken yapay zeka , Doğruluğun karşılaştırılması , Alt yapı modelleri

Kaynakça

S. R. Bowman, Eight things to know about large language models, arXiv preprint arXiv:2304.00612, 2023. https://doi.org/10.48550/arXiv.2304.01964
ChatGPT. https://chat.openai.com/ Accessed 27 Feb. 2024.
GPT-4. https://openai.com/research/gpt-4, Accessed 27 Feb. 2024.
Bing Chat: how to use Microsoft’s own version of ChatGPT Digital Trends. https://www.digitaltrends .com/computing/how-to-use-microsoft-chatgpt-bing-edge/, Accessed 27 Feb. 2024.
Gemini - Google DeepMind. https://deepmind.google /technologies/gemini/#gemini-1.0, Accessed 28 Feb. 2024.
A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, Large language models in medicine, Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023. https://doi.org/10.1038/s41591-023-02448-8
A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, K. J. Dreyer, M. D. Succi, Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast ımaging pilot, Journal of the American College of Radiology, vol. 20, no. 10, pp. 990–997, 2023. https://doi.org/10.1016/j.jacr.2023.05. 003
H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, Capabilities of gpt-4 on medical challenge problems, arXiv preprint arXiv:2303.13375, 2023. https://doi.org/10.48550/arXiv.2303.13375
A.Gilson, CW. Safranek, T. Huang, V. Socrates, L. Chi, RA. Taylor, D. Chartash, How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment, JMIR Medical Education, vol. 9, no. 1, p. e45312, 2023. doi:10.2196/45312
T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, V. Tseng, Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models, PLoS digital health, vol. 2, no. 2, p. e0000198, 2023. https://doi.org/10.1371/journal.pdig.0000198
R. K. Sinha, A. D. Roy, N. Kumar, H. Mondal, and R. Sinha, Applicability of ChatGPT in assisting to solve higher order problems in pathology, Cureus, vol. 15, no. 2, 2023. doi: 10.7759/cureus.35237
S. Huh, Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study,” Journal of educational evaluation for health professions, vol. 20, 2023. https://doi.org/10 .3352/jeehp.2023.20.1
X.Wang, Z. Gong, G. Wang, J. Jia, Y. Xu, J. Zhao, Q. Fan, S. Wu, W. Hu, X. Li, ChatGPT performs on the Chinese national medical licensing examination, Journal of Medical Systems, vol. 47, no. 1, p. 86, 2023. https://doi.org/10.1007/s10916-023-01961-0
M. F. Şahin, H. Ateş, A. Keleş, Ç. Doğan, M. Akgül, C. M. Yazıcı, R. Özcan, Responses of five different artificial ıntelligence chatbots to the top searched queries about erectile dysfunction: A comparative analysis, Journal of Medical Systems, vol. 48, no. 1, p. 38, 2024. https://doi.org/10.1007/s10916-024-02056-0
D. Brin, V. Sorin, Y. Barash, E. Konen, B. S. Glicksberg, G. N. Nadkarni, E. Klang, Assessing GPT-4 multimodal performance in radiological ımage analysis, medRxiv, pp. 2023–11, 2023. https://doi.org/ 10.1007/s00330-024-11035-5
J. L. Prince and J. M. Links, Medical imaging signals and systems, vol. 37. Pearson Prentice Hall Upper Saddle River, 2006.
E. Seeram, Medical Imaging Informatics, Digital Radiography: Review Questions, pp. 85–95, 2021.
K. H. Ng, J. H. D. Wong, and G. Clarke, Problems and solutions in medical physics: Diagnostic Imaging Physics. CRC Press, 2018.
W. R. Hendee and E. R. Ritenour, Medical imaging physics. John Wiley & Sons, 2003.
G. Sawhney, Fundamental of biomedical engineering. New Age International, 2007.
A. P. Dhawan, Medical image analysis. John Wiley & Sons, 2011.
B. H. Brown, R. H. Smallwood, D. C. Barber, P. Lawford, and D. Hose, Medical physics and biomedical engineering. CRC Press, 2017.
J. A. Miller, Review Questions for Ultrasound: A Sonographer’s Exam Guide. Routledge, 2018.
C. K. Roth and W. H. Faulkner Jr, Review questions for MRI, 2013.
S. C. Bushong and G. Clarke, Magnetic resonance imaging: physical and biological principles. Elsevier Health Sciences, 2003.
H. Azhari, J. A. Kennedy, N. Weiss, and L. Volokh, From Signals to Image. Springer, 2020.
W. A. Worthoff, H. G. Krojanski, and D. Suter, Medical physics: exercises and examples. Walter de Gruyter, 2013.
M. Chappell, Principles of Medical Imaging for Engineers. Springer, 2019.
E. Mantel, J. S. Reddin, G. Cheng, and A. Alavi, Nuclear Medicine Technology: Review Questions for the Board Examinations. Cham: Springer International Publishing, 2023. https://link.springer.com/10.1007/9 78-3-031-26720-8, Accessed 20 Mar. 2024.
K. H. Ng, C. H. Yeong, and A. C. Perkins, Problems and Solutions in Medical Physics: Nuclear Medicine Physics, 1st ed. CRC Press, 2019. https://www.taylor francis.com/books/9780429629129, Accessed 20 Mar. 2024.
D. D. Feng, Biomedical information technology. Academic Press, 2011.
IBM SPSS Statistics for Windows. IBM Corp., Armonk, NY, Released 2015.
M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012.

Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis

Yıl 2024, Cilt: 13 Sayı: 4, 1216 - 1223, 15.10.2024

Alpaslan Koç , Ayşe Betül Öztiryaki

https://doi.org/10.28948/ngumuh.1492129

Cited By: 1

https://izlik.org/JA75HK28GU

Öz

Large Language Models (LLMs) have gained popularity across healthcare and attracted the attention of researchers of various medical specialties. Determining which model performs well in which circumstances is essential for accurate results. This study aims to compare the accuracy of recently developed LLMs for medical imaging systems and to evaluate the reliability of LLMs in terms of correct responses. A total of 400 questions were divided into four categories: X-ray, ultrasound, magnetic resonance imaging, and nuclear medicine. LLMs’ responses were evaluated with a zero-prompting approach by measuring the percentage of correct answers. McNemar tests were used to evaluate the significance of differences between models, and Cohen kappa statistics were used to determine the reliability of the models. Gemini Advanced, GPT-4, Copilot, and GPT-3.5 resulted in accuracy rates of 86.25%, 84.25%, 77.5%, and 59.75%, respectively. There was a strong correlation between Gemini Advanced and the GPT-4 compared with other models, К=0.762. This study is the first that analyzes the accuracy of responses of recently developed LLMs: Gemini Advanced, GPT-4, Copilot, and GPT-3.5 on questions related to medical imaging systems. And a comprehensive dataset with three question types was created within medical imaging systems, which was evenly distributed from various sources.

Anahtar Kelimeler

Large language models , Medical imaging systems , Generative ai , Comparison of the accuracy , Foundation models

Kaynakça

S. R. Bowman, Eight things to know about large language models, arXiv preprint arXiv:2304.00612, 2023. https://doi.org/10.48550/arXiv.2304.01964
ChatGPT. https://chat.openai.com/ Accessed 27 Feb. 2024.
GPT-4. https://openai.com/research/gpt-4, Accessed 27 Feb. 2024.
Bing Chat: how to use Microsoft’s own version of ChatGPT Digital Trends. https://www.digitaltrends .com/computing/how-to-use-microsoft-chatgpt-bing-edge/, Accessed 27 Feb. 2024.
Gemini - Google DeepMind. https://deepmind.google /technologies/gemini/#gemini-1.0, Accessed 28 Feb. 2024.
A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, Large language models in medicine, Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023. https://doi.org/10.1038/s41591-023-02448-8
A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, K. J. Dreyer, M. D. Succi, Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast ımaging pilot, Journal of the American College of Radiology, vol. 20, no. 10, pp. 990–997, 2023. https://doi.org/10.1016/j.jacr.2023.05. 003
H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, Capabilities of gpt-4 on medical challenge problems, arXiv preprint arXiv:2303.13375, 2023. https://doi.org/10.48550/arXiv.2303.13375
A.Gilson, CW. Safranek, T. Huang, V. Socrates, L. Chi, RA. Taylor, D. Chartash, How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment, JMIR Medical Education, vol. 9, no. 1, p. e45312, 2023. doi:10.2196/45312
T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, V. Tseng, Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models, PLoS digital health, vol. 2, no. 2, p. e0000198, 2023. https://doi.org/10.1371/journal.pdig.0000198
R. K. Sinha, A. D. Roy, N. Kumar, H. Mondal, and R. Sinha, Applicability of ChatGPT in assisting to solve higher order problems in pathology, Cureus, vol. 15, no. 2, 2023. doi: 10.7759/cureus.35237
S. Huh, Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study,” Journal of educational evaluation for health professions, vol. 20, 2023. https://doi.org/10 .3352/jeehp.2023.20.1
X.Wang, Z. Gong, G. Wang, J. Jia, Y. Xu, J. Zhao, Q. Fan, S. Wu, W. Hu, X. Li, ChatGPT performs on the Chinese national medical licensing examination, Journal of Medical Systems, vol. 47, no. 1, p. 86, 2023. https://doi.org/10.1007/s10916-023-01961-0
M. F. Şahin, H. Ateş, A. Keleş, Ç. Doğan, M. Akgül, C. M. Yazıcı, R. Özcan, Responses of five different artificial ıntelligence chatbots to the top searched queries about erectile dysfunction: A comparative analysis, Journal of Medical Systems, vol. 48, no. 1, p. 38, 2024. https://doi.org/10.1007/s10916-024-02056-0
D. Brin, V. Sorin, Y. Barash, E. Konen, B. S. Glicksberg, G. N. Nadkarni, E. Klang, Assessing GPT-4 multimodal performance in radiological ımage analysis, medRxiv, pp. 2023–11, 2023. https://doi.org/ 10.1007/s00330-024-11035-5
J. L. Prince and J. M. Links, Medical imaging signals and systems, vol. 37. Pearson Prentice Hall Upper Saddle River, 2006.
E. Seeram, Medical Imaging Informatics, Digital Radiography: Review Questions, pp. 85–95, 2021.
K. H. Ng, J. H. D. Wong, and G. Clarke, Problems and solutions in medical physics: Diagnostic Imaging Physics. CRC Press, 2018.
W. R. Hendee and E. R. Ritenour, Medical imaging physics. John Wiley & Sons, 2003.
G. Sawhney, Fundamental of biomedical engineering. New Age International, 2007.
A. P. Dhawan, Medical image analysis. John Wiley & Sons, 2011.
B. H. Brown, R. H. Smallwood, D. C. Barber, P. Lawford, and D. Hose, Medical physics and biomedical engineering. CRC Press, 2017.
J. A. Miller, Review Questions for Ultrasound: A Sonographer’s Exam Guide. Routledge, 2018.
C. K. Roth and W. H. Faulkner Jr, Review questions for MRI, 2013.
S. C. Bushong and G. Clarke, Magnetic resonance imaging: physical and biological principles. Elsevier Health Sciences, 2003.
H. Azhari, J. A. Kennedy, N. Weiss, and L. Volokh, From Signals to Image. Springer, 2020.
W. A. Worthoff, H. G. Krojanski, and D. Suter, Medical physics: exercises and examples. Walter de Gruyter, 2013.
M. Chappell, Principles of Medical Imaging for Engineers. Springer, 2019.
E. Mantel, J. S. Reddin, G. Cheng, and A. Alavi, Nuclear Medicine Technology: Review Questions for the Board Examinations. Cham: Springer International Publishing, 2023. https://link.springer.com/10.1007/9 78-3-031-26720-8, Accessed 20 Mar. 2024.
K. H. Ng, C. H. Yeong, and A. C. Perkins, Problems and Solutions in Medical Physics: Nuclear Medicine Physics, 1st ed. CRC Press, 2019. https://www.taylor francis.com/books/9780429629129, Accessed 20 Mar. 2024.
D. D. Feng, Biomedical information technology. Academic Press, 2011.
IBM SPSS Statistics for Windows. IBM Corp., Armonk, NY, Released 2015.
M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012.

Toplam 33 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Doğal Dil İşleme, Planlama ve Karar Verme, Biyomedikal Bilimler ve Teknolojiler
Bölüm	Araştırma Makalesi
Yazarlar	Alpaslan Koç 0000-0002-2000-7379 Ayşe Betül Öztiryaki 0009-0004-9973-3251
Gönderilme Tarihi	29 Mayıs 2024
Kabul Tarihi	30 Temmuz 2024
Erken Görünüm Tarihi	11 Eylül 2024
Yayımlanma Tarihi	15 Ekim 2024
DOI	https://doi.org/10.28948/ngumuh.1492129
IZ	https://izlik.org/JA75HK28GU
Yayımlandığı Sayı	Yıl 2024 Cilt: 13 Sayı: 4

Kaynak Göster

APA	Koç, A., & Öztiryaki, A. B. (2024). Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, 13(4), 1216-1223. https://doi.org/10.28948/ngumuh.1492129
AMA	1.Koç A, Öztiryaki AB. Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. NÖHÜ Müh. Bilim. Derg. 2024;13(4):1216-1223. doi:10.28948/ngumuh.1492129
Chicago	Koç, Alpaslan, ve Ayşe Betül Öztiryaki. 2024. “Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13 (4): 1216-23. https://doi.org/10.28948/ngumuh.1492129.
EndNote	Koç A, Öztiryaki AB (01 Ekim 2024) Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13 4 1216–1223.
IEEE	[1]A. Koç ve A. B. Öztiryaki, “Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis”, NÖHÜ Müh. Bilim. Derg., c. 13, sy 4, ss. 1216–1223, Eki. 2024, doi: 10.28948/ngumuh.1492129.
ISNAD	Koç, Alpaslan - Öztiryaki, Ayşe Betül. “Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13/4 (01 Ekim 2024): 1216-1223. https://doi.org/10.28948/ngumuh.1492129.
JAMA	1.Koç A, Öztiryaki AB. Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. NÖHÜ Müh. Bilim. Derg. 2024;13:1216–1223.
MLA	Koç, Alpaslan, ve Ayşe Betül Öztiryaki. “Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, c. 13, sy 4, Ekim 2024, ss. 1216-23, doi:10.28948/ngumuh.1492129.
Vancouver	1.Koç A, Öztiryaki AB. Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. NÖHÜ Müh. Bilim. Derg. [Internet]. 01 Ekim 2024;13(4):1216-23. Erişim adresi: https://izlik.org/JA75HK28GU

Cited By

Performance of Generative Artificial Intelligence Models (GPT-4o, Gemini, Copilot) in YKS/TYT Exam: A Comparative Study

Bilişim Teknolojileri Dergisi

https://doi.org/10.17671/gazibtd.1575755

Makale Dosyaları

Tam Metin