Comparative analysis of large language models' performance in breast ımaging

Muhammed Said Beşler

doi:10.18663/tjcl.1561361

Research Article

Büyük dil modellerinin meme görüntülemedeki performansının karşılaştırmalı analizi

Year 2024, Volume: 15 Issue: 4, 542 - 546, 31.12.2024

Muhammed Said Beşler

https://doi.org/10.18663/tjcl.1561361

Abstract

Amaç: OpenAI’ın GPT-4o ve Anthropic’in Claude 3.5 Sonnet modellerinin meme görüntüleme vakalarındaki
performanslarını değerlendirmek.
Gereç ve Yöntemler: Veri seti, Society of Breast Imaging'in herkese açık olan Ayın Vakası arşivindeki vakalardan oluşmaktaydı. Sorular, sadece metin tabanlı ya da mamografi, ultrason, manyetik rezonans görüntüleme veya hibrit görüntüleme içeren sorular olarak sınıflandırıldı. GPT-4o ve Claude 3.5 Sonnet'in doğruluk oranları Mann-Whitney U testi kullanılarak karşılaştırıldı.
Bulgular: Toplam 94 sorunun %61,7’si görüntü tabanlıydı. GPT-4o'nun genel doğruluk oranı, Claude 3.5 Sonnet’ten yüksekti (sırasıyla %75,4 ve %67,7; p=0,432). GPT-4o, ultrason ve hibrit görüntüleme tabanlı sorularda daha yüksek skorlar elde ederken, Claude 3.5 Sonnet mamografi tabanlı sorularda daha iyi performans gösterdi. Tümör grubundaki vakalarda her iki model de tümör dışı gruba göre daha yüksek doğruluk oranlarına ulaştı (her ikisi için de p>0,05). Modellerin meme görüntüleme vakalarındaki genel performansı %75’in üzerinde olup, farklı görüntüleme modaliteleri içeren sorular için %64-83 aralığındaydı.
Sonuç: Meme görüntüleme vakalarında, GPT-4o genel olarak görüntü tabanlı ve diğer soru türlerinde Claude 3.5 Sonnet'ten daha yüksek doğruluk oranlarına ulaşmış olsa da, modellerin performansları karşılaştırılabilir düzeydedir.

Keywords

yapay zeka, büyük dil modeli, meme görüntüleme, mamografi, ultrason

References

Kim S, Lee CK, Kim SS. Large Language Models: A Guide for Radiologists. Korean J Radiol. 2024;25(2):126-133. doi:10.3348/ kjr.2023.0997
https://openai.com/index/hello-gpt-4o/ accessed on July 28, 2024
https://www.anthropic.com/news/claude-3-5-sonnet accessed on July 28, 2024
Sonoda Y, Kurokawa R, Nakamura Y, et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases. Jpn J Radiol. Published online July 1, 2024. doi:10.1007/s11604-024-01619-y
Oura T, Tatekawa H, Horiuchi D, et al. Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations. Jpn J Radiol. Published online July 20, 2024. doi:10.1007/s11604-024-01633-0
Sorin V, Glicksberg BS, Artsi Y, et al. Utilizing large language models in breast cancer management: systematic review. J Cancer Res Clin Oncol. 2024;150(3):140. Published 2024 Mar 19. doi:10.1007/s00432-024-05678-6
Cozzi A, Pinker K, Hidber A, et al. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology. 2024;311(1):e232133. doi:10.1148/radiol.232133
Choi HS, Song JY, Shin KH, Chang JH, Jang BS. Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer. Radiat Oncol J. 2023;41(3):209-216. doi:10.3857/ roj.2023.00633
Almeida LC, Farina EMJM, Kuriki PEA, Abdala N, Kitamura FC. Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations. Radiol Artif Intell. 2024;6(1):e230103. doi:10.1148/ryai.230103
Haver HL, Bahl M, Doo FX, et al. Evaluation of Multimodal ChatGPT (GPT-4V) in Describing Mammography Image Features. Can Assoc Radiol J. Published online April 6, 2024. doi:10.1177/08465371241247043
Hirano Y, Hanaoka S, Nakao T, et al. GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination. Jpn J Radiol. 2024;42(8):918-926. doi:10.1007/s11604-024-01561-z
Payne DL, Purohit K, Borrero WM, et al. Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning. Acad Radiol. 2024;31(7):3046-3054. doi:10.1016/j.acra.2024.04.006
Horiuchi D, Tatekawa H, Oura T, et al. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol. Published online July 12, 2024. doi:10.1007/ s00330-024-10902-5
Sood A, Mansoor N, Memmi C, Lynch M, Lynch J. Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions. Int J Comput Assist Radiol Surg. 2024;19(4):645- 653. doi:10.1007/s11548-024-03071-9

Comparative analysis of large language models' performance in breast ımaging

Year 2024, Volume: 15 Issue: 4, 542 - 546, 31.12.2024

Muhammed Said Beşler

https://doi.org/10.18663/tjcl.1561361

Abstract

Aim: To evaluate the performance of the flagship models, OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet, in breast
imaging cases.
Material and Methods: The dataset consisted of cases from the publicly available Case of the Month archive by the Society of Breast Imaging. Questions were classified as text-based or containing images from mammography, ultrasound, magnetic resonance imaging, or hybrid imaging. The accuracy rates of GPT-4o and Claude 3.5 Sonnet were compared using the Mann-Whitney U test.
Results: Of the total 94 questions, 61.7% were image-based. The overall accuracy rate of GPT-4o was higher than that of Claude 3.5 Sonnet (75.4% vs. 67.7%, p=0.432). GPT-4o achieved higher scores on questions based on ultrasound and hybrid imaging, while Claude 3.5 Sonnet performed better on mammography-based questions. In tumor group cases, both models reached higher accuracy rates compared to the non-tumor group (both, p>0.05). The models' performance in breast imaging cases overall exceeded 75%, ranging between 64-83% for questions involving different imaging modalities.
Conclusion: In breast imaging cases, although GPT-4o generally achieved higher accuracy rates than Claude 3.5 Sonnet in image-based and other types of questions, their performances were comparable.

Keywords

artificial intelligence, large language model, breast imaging, mammography, ultrasound

References

Kim S, Lee CK, Kim SS. Large Language Models: A Guide for Radiologists. Korean J Radiol. 2024;25(2):126-133. doi:10.3348/ kjr.2023.0997
https://openai.com/index/hello-gpt-4o/ accessed on July 28, 2024
https://www.anthropic.com/news/claude-3-5-sonnet accessed on July 28, 2024
Sonoda Y, Kurokawa R, Nakamura Y, et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases. Jpn J Radiol. Published online July 1, 2024. doi:10.1007/s11604-024-01619-y
Oura T, Tatekawa H, Horiuchi D, et al. Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations. Jpn J Radiol. Published online July 20, 2024. doi:10.1007/s11604-024-01633-0
Sorin V, Glicksberg BS, Artsi Y, et al. Utilizing large language models in breast cancer management: systematic review. J Cancer Res Clin Oncol. 2024;150(3):140. Published 2024 Mar 19. doi:10.1007/s00432-024-05678-6
Cozzi A, Pinker K, Hidber A, et al. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology. 2024;311(1):e232133. doi:10.1148/radiol.232133
Choi HS, Song JY, Shin KH, Chang JH, Jang BS. Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer. Radiat Oncol J. 2023;41(3):209-216. doi:10.3857/ roj.2023.00633
Almeida LC, Farina EMJM, Kuriki PEA, Abdala N, Kitamura FC. Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations. Radiol Artif Intell. 2024;6(1):e230103. doi:10.1148/ryai.230103
Haver HL, Bahl M, Doo FX, et al. Evaluation of Multimodal ChatGPT (GPT-4V) in Describing Mammography Image Features. Can Assoc Radiol J. Published online April 6, 2024. doi:10.1177/08465371241247043
Hirano Y, Hanaoka S, Nakao T, et al. GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination. Jpn J Radiol. 2024;42(8):918-926. doi:10.1007/s11604-024-01561-z
Payne DL, Purohit K, Borrero WM, et al. Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning. Acad Radiol. 2024;31(7):3046-3054. doi:10.1016/j.acra.2024.04.006
Horiuchi D, Tatekawa H, Oura T, et al. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol. Published online July 12, 2024. doi:10.1007/ s00330-024-10902-5
Sood A, Mansoor N, Memmi C, Lynch M, Lynch J. Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions. Int J Comput Assist Radiol Surg. 2024;19(4):645- 653. doi:10.1007/s11548-024-03071-9

There are 14 citations in total.

Details

Primary Language	English
Subjects	Radiology and Organ Imaging
Journal Section	Research Article
Authors	Muhammed Said Beşler 0000-0001-8316-7129
Publication Date	December 31, 2024
Submission Date	October 4, 2024
Acceptance Date	October 18, 2024
Published in Issue	Year 2024 Volume: 15 Issue: 4

Cite

APA	Beşler, M. S. (2024). Comparative analysis of large language models’ performance in breast ımaging. Turkish Journal of Clinics and Laboratory, 15(4), 542-546. https://doi.org/10.18663/tjcl.1561361
AMA	Beşler MS. Comparative analysis of large language models’ performance in breast ımaging. TJCL. December 2024;15(4):542-546. doi:10.18663/tjcl.1561361
Chicago	Beşler, Muhammed Said. “Comparative Analysis of Large Language models’ Performance in Breast ımaging”. Turkish Journal of Clinics and Laboratory 15, no. 4 (December 2024): 542-46. https://doi.org/10.18663/tjcl.1561361.
EndNote	Beşler MS (December 1, 2024) Comparative analysis of large language models’ performance in breast ımaging. Turkish Journal of Clinics and Laboratory 15 4 542–546.
IEEE	M. S. Beşler, “Comparative analysis of large language models’ performance in breast ımaging”, TJCL, vol. 15, no. 4, pp. 542–546, 2024, doi: 10.18663/tjcl.1561361.
ISNAD	Beşler, Muhammed Said. “Comparative Analysis of Large Language models’ Performance in Breast ımaging”. Turkish Journal of Clinics and Laboratory 15/4 (December 2024), 542-546. https://doi.org/10.18663/tjcl.1561361.
JAMA	Beşler MS. Comparative analysis of large language models’ performance in breast ımaging. TJCL. 2024;15:542–546.
MLA	Beşler, Muhammed Said. “Comparative Analysis of Large Language models’ Performance in Breast ımaging”. Turkish Journal of Clinics and Laboratory, vol. 15, no. 4, 2024, pp. 542-6, doi:10.18663/tjcl.1561361.
Vancouver	Beşler MS. Comparative analysis of large language models’ performance in breast ımaging. TJCL. 2024;15(4):542-6.

Download Cover Image

Article Files

Full Text

e-ISSN: 2149-8296

The content of this site is intended for health care professionals. All the published articles are distributed under the terms of

Creative Commons Attribution Licence,

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.