Araştırma Makalesi
BibTex RIS Kaynak Göster

Tıbbi görüntüleme sistemlerinde Gemini Advanced, GPT-4, Copilot ve GPT-3.5 modellerinin doğruluk performanslarının karşılaştırılması: Sıfır atışlı yönlendirme analizi

Yıl 2024, , 1216 - 1223, 15.10.2024
https://doi.org/10.28948/ngumuh.1492129

Öz

Büyük dil modelleri (LLM'ler) sağlık hizmetlerinde popülerlik kazanmış ve çeşitli tıbbi uzmanlık alanlarındaki araştırmacıların ilgisini çekmektedir. Doğru sonuçlar için hangi modelin hangi koşullarda iyi performans gösterdiğini belirlemek önemlidir. Bu çalışma, yeni geliştirilen büyük dil modellerinin tıbbi görüntüleme sistemleri için doğruluklarını karşılaştırmayı ve bu modellerin verdikleri doğru yanıtlar açısından birbirleri arasındaki uyumluluklarını değerlendirmeyi amaçlamaktadır. Bu değerlendirme için toplam 400 soru X-ray, ultrason, manyetik rezonans görüntüleme ve nükleer tıp görüntüleme olarak dört kategoriye ayrılmıştır. Büyük dil modellerinin yanıtları, doğru yanıtların yüzdesi ölçülerek sıfır-atışlı yönlendirme yaklaşımıyla değerlendirilmiştir. Modeller arasındaki farkların anlamlılığını değerlendirmek için McNemar testi, modellerin güvenilirliğini belirlemek için ise Cohen kappa istatistiği kullanılmıştır. Gemini Advanced, GPT-4, Copilot ve GPT-3.5 için sırasıyla %86.25, %84.25, %77.5 ve %59.75 doğruluk oranları elde edilmiştir. Diğer modellerle karşılaştırıldığında Gemini Advanced ve GPT-4 arasında güçlü bir korelasyon bulunmuştur, К=0,762. Bu çalışma, yakın zamanda geliştirilen Gemini Advanced, GPT-4, Copilot ve GPT-3.5'in tıbbi görüntüleme sistemleriyle ilgili sorulara verdiği yanıtların doğruluğunu analiz eden ilk çalışmadır. Ayrıca bu çalışma ile tıbbi görüntüleme sistemleri ile ilgili çeşitli kaynaklardan üç soru tipinden oluşan kapsamlı bir veri seti oluşturulmuştur.

Kaynakça

  • S. R. Bowman, Eight things to know about large language models, arXiv preprint arXiv:2304.00612, 2023. https://doi.org/10.48550/arXiv.2304.01964
  • ChatGPT. https://chat.openai.com/ Accessed 27 Feb. 2024.
  • GPT-4. https://openai.com/research/gpt-4, Accessed 27 Feb. 2024.
  • Bing Chat: how to use Microsoft’s own version of ChatGPT Digital Trends. https://www.digitaltrends .com/computing/how-to-use-microsoft-chatgpt-bing-edge/, Accessed 27 Feb. 2024.
  • Gemini - Google DeepMind. https://deepmind.google /technologies/gemini/#gemini-1.0, Accessed 28 Feb. 2024.
  • A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, Large language models in medicine, Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023. https://doi.org/10.1038/s41591-023-02448-8
  • A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, K. J. Dreyer, M. D. Succi, Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast ımaging pilot, Journal of the American College of Radiology, vol. 20, no. 10, pp. 990–997, 2023. https://doi.org/10.1016/j.jacr.2023.05. 003
  • H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, Capabilities of gpt-4 on medical challenge problems, arXiv preprint arXiv:2303.13375, 2023. https://doi.org/10.48550/arXiv.2303.13375
  • A.Gilson, CW. Safranek, T. Huang, V. Socrates, L. Chi, RA. Taylor, D. Chartash, How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment, JMIR Medical Education, vol. 9, no. 1, p. e45312, 2023. doi:10.2196/45312
  • T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, V. Tseng, Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models, PLoS digital health, vol. 2, no. 2, p. e0000198, 2023. https://doi.org/10.1371/journal.pdig.0000198
  • R. K. Sinha, A. D. Roy, N. Kumar, H. Mondal, and R. Sinha, Applicability of ChatGPT in assisting to solve higher order problems in pathology, Cureus, vol. 15, no. 2, 2023. doi: 10.7759/cureus.35237
  • S. Huh, Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study,” Journal of educational evaluation for health professions, vol. 20, 2023. https://doi.org/10 .3352/jeehp.2023.20.1
  • X.Wang, Z. Gong, G. Wang, J. Jia, Y. Xu, J. Zhao, Q. Fan, S. Wu, W. Hu, X. Li, ChatGPT performs on the Chinese national medical licensing examination, Journal of Medical Systems, vol. 47, no. 1, p. 86, 2023. https://doi.org/10.1007/s10916-023-01961-0
  • M. F. Şahin, H. Ateş, A. Keleş, Ç. Doğan, M. Akgül, C. M. Yazıcı, R. Özcan, Responses of five different artificial ıntelligence chatbots to the top searched queries about erectile dysfunction: A comparative analysis, Journal of Medical Systems, vol. 48, no. 1, p. 38, 2024. https://doi.org/10.1007/s10916-024-02056-0
  • D. Brin, V. Sorin, Y. Barash, E. Konen, B. S. Glicksberg, G. N. Nadkarni, E. Klang, Assessing GPT-4 multimodal performance in radiological ımage analysis, medRxiv, pp. 2023–11, 2023. https://doi.org/ 10.1007/s00330-024-11035-5
  • J. L. Prince and J. M. Links, Medical imaging signals and systems, vol. 37. Pearson Prentice Hall Upper Saddle River, 2006.
  • E. Seeram, Medical Imaging Informatics, Digital Radiography: Review Questions, pp. 85–95, 2021.
  • K. H. Ng, J. H. D. Wong, and G. Clarke, Problems and solutions in medical physics: Diagnostic Imaging Physics. CRC Press, 2018.
  • W. R. Hendee and E. R. Ritenour, Medical imaging physics. John Wiley & Sons, 2003.
  • G. Sawhney, Fundamental of biomedical engineering. New Age International, 2007.
  • A. P. Dhawan, Medical image analysis. John Wiley & Sons, 2011.
  • B. H. Brown, R. H. Smallwood, D. C. Barber, P. Lawford, and D. Hose, Medical physics and biomedical engineering. CRC Press, 2017.
  • J. A. Miller, Review Questions for Ultrasound: A Sonographer’s Exam Guide. Routledge, 2018.
  • C. K. Roth and W. H. Faulkner Jr, Review questions for MRI, 2013.
  • S. C. Bushong and G. Clarke, Magnetic resonance imaging: physical and biological principles. Elsevier Health Sciences, 2003.
  • H. Azhari, J. A. Kennedy, N. Weiss, and L. Volokh, From Signals to Image. Springer, 2020.
  • W. A. Worthoff, H. G. Krojanski, and D. Suter, Medical physics: exercises and examples. Walter de Gruyter, 2013.
  • M. Chappell, Principles of Medical Imaging for Engineers. Springer, 2019.
  • E. Mantel, J. S. Reddin, G. Cheng, and A. Alavi, Nuclear Medicine Technology: Review Questions for the Board Examinations. Cham: Springer International Publishing, 2023. https://link.springer.com/10.1007/9 78-3-031-26720-8, Accessed 20 Mar. 2024.
  • K. H. Ng, C. H. Yeong, and A. C. Perkins, Problems and Solutions in Medical Physics: Nuclear Medicine Physics, 1st ed. CRC Press, 2019. https://www.taylor francis.com/books/9780429629129, Accessed 20 Mar. 2024.
  • D. D. Feng, Biomedical information technology. Academic Press, 2011.
  • IBM SPSS Statistics for Windows. IBM Corp., Armonk, NY, Released 2015.
  • M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012.

Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis

Yıl 2024, , 1216 - 1223, 15.10.2024
https://doi.org/10.28948/ngumuh.1492129

Öz

Large Language Models (LLMs) have gained popularity across healthcare and attracted the attention of researchers of various medical specialties. Determining which model performs well in which circumstances is essential for accurate results. This study aims to compare the accuracy of recently developed LLMs for medical imaging systems and to evaluate the reliability of LLMs in terms of correct responses. A total of 400 questions were divided into four categories: X-ray, ultrasound, magnetic resonance imaging, and nuclear medicine. LLMs’ responses were evaluated with a zero-prompting approach by measuring the percentage of correct answers. McNemar tests were used to evaluate the significance of differences between models, and Cohen kappa statistics were used to determine the reliability of the models. Gemini Advanced, GPT-4, Copilot, and GPT-3.5 resulted in accuracy rates of 86.25%, 84.25%, 77.5%, and 59.75%, respectively. There was a strong correlation between Gemini Advanced and the GPT-4 compared with other models, К=0.762. This study is the first that analyzes the accuracy of responses of recently developed LLMs: Gemini Advanced, GPT-4, Copilot, and GPT-3.5 on questions related to medical imaging systems. And a comprehensive dataset with three question types was created within medical imaging systems, which was evenly distributed from various sources.

Kaynakça

  • S. R. Bowman, Eight things to know about large language models, arXiv preprint arXiv:2304.00612, 2023. https://doi.org/10.48550/arXiv.2304.01964
  • ChatGPT. https://chat.openai.com/ Accessed 27 Feb. 2024.
  • GPT-4. https://openai.com/research/gpt-4, Accessed 27 Feb. 2024.
  • Bing Chat: how to use Microsoft’s own version of ChatGPT Digital Trends. https://www.digitaltrends .com/computing/how-to-use-microsoft-chatgpt-bing-edge/, Accessed 27 Feb. 2024.
  • Gemini - Google DeepMind. https://deepmind.google /technologies/gemini/#gemini-1.0, Accessed 28 Feb. 2024.
  • A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, Large language models in medicine, Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023. https://doi.org/10.1038/s41591-023-02448-8
  • A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, K. J. Dreyer, M. D. Succi, Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast ımaging pilot, Journal of the American College of Radiology, vol. 20, no. 10, pp. 990–997, 2023. https://doi.org/10.1016/j.jacr.2023.05. 003
  • H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, Capabilities of gpt-4 on medical challenge problems, arXiv preprint arXiv:2303.13375, 2023. https://doi.org/10.48550/arXiv.2303.13375
  • A.Gilson, CW. Safranek, T. Huang, V. Socrates, L. Chi, RA. Taylor, D. Chartash, How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment, JMIR Medical Education, vol. 9, no. 1, p. e45312, 2023. doi:10.2196/45312
  • T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, V. Tseng, Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models, PLoS digital health, vol. 2, no. 2, p. e0000198, 2023. https://doi.org/10.1371/journal.pdig.0000198
  • R. K. Sinha, A. D. Roy, N. Kumar, H. Mondal, and R. Sinha, Applicability of ChatGPT in assisting to solve higher order problems in pathology, Cureus, vol. 15, no. 2, 2023. doi: 10.7759/cureus.35237
  • S. Huh, Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study,” Journal of educational evaluation for health professions, vol. 20, 2023. https://doi.org/10 .3352/jeehp.2023.20.1
  • X.Wang, Z. Gong, G. Wang, J. Jia, Y. Xu, J. Zhao, Q. Fan, S. Wu, W. Hu, X. Li, ChatGPT performs on the Chinese national medical licensing examination, Journal of Medical Systems, vol. 47, no. 1, p. 86, 2023. https://doi.org/10.1007/s10916-023-01961-0
  • M. F. Şahin, H. Ateş, A. Keleş, Ç. Doğan, M. Akgül, C. M. Yazıcı, R. Özcan, Responses of five different artificial ıntelligence chatbots to the top searched queries about erectile dysfunction: A comparative analysis, Journal of Medical Systems, vol. 48, no. 1, p. 38, 2024. https://doi.org/10.1007/s10916-024-02056-0
  • D. Brin, V. Sorin, Y. Barash, E. Konen, B. S. Glicksberg, G. N. Nadkarni, E. Klang, Assessing GPT-4 multimodal performance in radiological ımage analysis, medRxiv, pp. 2023–11, 2023. https://doi.org/ 10.1007/s00330-024-11035-5
  • J. L. Prince and J. M. Links, Medical imaging signals and systems, vol. 37. Pearson Prentice Hall Upper Saddle River, 2006.
  • E. Seeram, Medical Imaging Informatics, Digital Radiography: Review Questions, pp. 85–95, 2021.
  • K. H. Ng, J. H. D. Wong, and G. Clarke, Problems and solutions in medical physics: Diagnostic Imaging Physics. CRC Press, 2018.
  • W. R. Hendee and E. R. Ritenour, Medical imaging physics. John Wiley & Sons, 2003.
  • G. Sawhney, Fundamental of biomedical engineering. New Age International, 2007.
  • A. P. Dhawan, Medical image analysis. John Wiley & Sons, 2011.
  • B. H. Brown, R. H. Smallwood, D. C. Barber, P. Lawford, and D. Hose, Medical physics and biomedical engineering. CRC Press, 2017.
  • J. A. Miller, Review Questions for Ultrasound: A Sonographer’s Exam Guide. Routledge, 2018.
  • C. K. Roth and W. H. Faulkner Jr, Review questions for MRI, 2013.
  • S. C. Bushong and G. Clarke, Magnetic resonance imaging: physical and biological principles. Elsevier Health Sciences, 2003.
  • H. Azhari, J. A. Kennedy, N. Weiss, and L. Volokh, From Signals to Image. Springer, 2020.
  • W. A. Worthoff, H. G. Krojanski, and D. Suter, Medical physics: exercises and examples. Walter de Gruyter, 2013.
  • M. Chappell, Principles of Medical Imaging for Engineers. Springer, 2019.
  • E. Mantel, J. S. Reddin, G. Cheng, and A. Alavi, Nuclear Medicine Technology: Review Questions for the Board Examinations. Cham: Springer International Publishing, 2023. https://link.springer.com/10.1007/9 78-3-031-26720-8, Accessed 20 Mar. 2024.
  • K. H. Ng, C. H. Yeong, and A. C. Perkins, Problems and Solutions in Medical Physics: Nuclear Medicine Physics, 1st ed. CRC Press, 2019. https://www.taylor francis.com/books/9780429629129, Accessed 20 Mar. 2024.
  • D. D. Feng, Biomedical information technology. Academic Press, 2011.
  • IBM SPSS Statistics for Windows. IBM Corp., Armonk, NY, Released 2015.
  • M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012.
Toplam 33 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Doğal Dil İşleme, Planlama ve Karar Verme, Biyomedikal Bilimler ve Teknolojiler
Bölüm Araştırma Makaleleri
Yazarlar

Alpaslan Koç 0000-0002-2000-7379

Ayşe Betül Öztiryaki 0009-0004-9973-3251

Erken Görünüm Tarihi 11 Eylül 2024
Yayımlanma Tarihi 15 Ekim 2024
Gönderilme Tarihi 29 Mayıs 2024
Kabul Tarihi 30 Temmuz 2024
Yayımlandığı Sayı Yıl 2024

Kaynak Göster

APA Koç, A., & Öztiryaki, A. B. (2024). Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, 13(4), 1216-1223. https://doi.org/10.28948/ngumuh.1492129
AMA Koç A, Öztiryaki AB. Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. NÖHÜ Müh. Bilim. Derg. Ekim 2024;13(4):1216-1223. doi:10.28948/ngumuh.1492129
Chicago Koç, Alpaslan, ve Ayşe Betül Öztiryaki. “Comparison of the Accuracy Performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 Models in Medical Imaging Systems: A Zero-Shot Prompting Analysis”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13, sy. 4 (Ekim 2024): 1216-23. https://doi.org/10.28948/ngumuh.1492129.
EndNote Koç A, Öztiryaki AB (01 Ekim 2024) Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13 4 1216–1223.
IEEE A. Koç ve A. B. Öztiryaki, “Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis”, NÖHÜ Müh. Bilim. Derg., c. 13, sy. 4, ss. 1216–1223, 2024, doi: 10.28948/ngumuh.1492129.
ISNAD Koç, Alpaslan - Öztiryaki, Ayşe Betül. “Comparison of the Accuracy Performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 Models in Medical Imaging Systems: A Zero-Shot Prompting Analysis”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13/4 (Ekim 2024), 1216-1223. https://doi.org/10.28948/ngumuh.1492129.
JAMA Koç A, Öztiryaki AB. Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. NÖHÜ Müh. Bilim. Derg. 2024;13:1216–1223.
MLA Koç, Alpaslan ve Ayşe Betül Öztiryaki. “Comparison of the Accuracy Performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 Models in Medical Imaging Systems: A Zero-Shot Prompting Analysis”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, c. 13, sy. 4, 2024, ss. 1216-23, doi:10.28948/ngumuh.1492129.
Vancouver Koç A, Öztiryaki AB. Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. NÖHÜ Müh. Bilim. Derg. 2024;13(4):1216-23.

download