A Comparative Study on the Question-Answering Proficiency of Artificial Intelligence Models in Bladder-Related Conditions: An Evaluation of Gemini and ChatGPT 4.o

Mustafa Azizoğlu; Sergey Klyuev

doi:10.37990/medr.1601528

Research Article

A Comparative Study on the Question-Answering Proficiency of Artificial Intelligence Models in Bladder-Related Conditions: An Evaluation of Gemini and ChatGPT 4.o

Year 2025, , 201 - 205, 15.01.2025

Mustafa Azizoğlu , Sergey Klyuev

https://doi.org/10.37990/medr.1601528

Abstract

Aim: The rapid evolution of artificial intelligence (AI) has revolutionized medicine, with tools like ChatGPT and Google Gemini enhancing clinical decision-making. ChatGPT's advancements, particularly with GPT-4, show promise in diagnostics and education. However, variability in accuracy and limitations in complex scenarios emphasize the need for further evaluation of these models in medical applications. This study aimed to assess the accuracy and agreement between ChatGPT 4.o and Gemini AI in identifying bladder-related conditions, including neurogenic bladder, vesicoureteral reflux (VUR), and posterior urethral valve (PUV).
Material and Method: This study, conducted in October 2024, compared ChatGPT 4.o and Gemini AI's accuracy on 51 questions about neurogenic bladder, VUR, and PUV. Questions, randomly selected from pediatric surgery and urology materials, were evaluated using accuracy metrics and statistical analysis, highlighting AI models' performance and agreement.
Results: ChatGPT 4.o and Gemini AI demonstrated similar accuracy across neurogenic bladder, VUR, and PUV questions, with true response rates of 66.7% and 68.6%, respectively, and no statistically significant differences (p>0.05). Combined accuracy across all topics was 67.6%. Strong inter-rater reliability (κ=0.87) highlights their agreement.
Conclusion: This study highlights the comparable accuracy of ChatGPT-4.o and Gemini AI across key bladder-related conditions, with no significant differences in performance.

Keywords

ChatGPT, Gemini, articifial intelligence, bladder

References

Demir S. Evaluation of responses to questions about keratoconus using ChatGPT-4.0, Google Gemini and Microsoft Copilot: a comparative study of large language models on Keratoconus. Eye Contact Lens. 2024 Dec 4. doi: 10.1097/ICL.0000000000001158. [Epub ahead of print].
Sun SH, Chen K, Anavim S, et al. Large language models with vision on diagnostic radiology board exam style questions. Acad Radiol. 2024 Dec 3. doi: 10.1016/j.acra.2024.11.028. [Epub ahead of print].
Galvis-García E, Vega-González FJ, Emura F, et al. Inteligencia artificial en la colonoscopia de tamizaje y la disminución del error. Cir Cir. 2023;91:411-21.
De Busser B, Roth L, De Loof H. The role of large language models in self-care: a study and benchmark on medicines and supplement guidance accuracy. Int J Clin Pharm. 2024 Dec 7. doi: 10.1007/s11096-024-01839-2. [Epub ahead of print].
Ardila CM, Yadalam PK. ChatGPT's influence on dental education: methodological challenges and ethical considerations. Int Dent J. 2024 Dec 6. doi: 10.1016/j.identj.2024.11.014. [Epub ahead of print].
Meo AS, Shaikh N, Meo SA. Assessing the accuracy and efficiency of Chat GPT-4 Omni (GPT-4o) in biomedical statistics: Comparative study with traditional tools. Saudi Med J. 2024;45:1383-90.
Chen Y, Huang X, Yang F, et al. Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study. BMC Med Educ. 2024;24:1372.
Bilgin IA, Percem AK, Aslan O. Artificial intelligence and robotic surgery in colorectal cancer surgery. J Clin Trials Exp Investig. 2024;3:83-4.
Yılmaz M. Revolutionizing laboratory medicine: the critical role of artificial intelligence and deep learning: Artificial intelligence and medical laboratory. The Injector. 2024;3:39-40.
Maraqa N, Samargandi R, Poichotte A, et al. Comparing performances of french orthopaedic surgery residents with the artificial intelligence ChatGPT-4/4o in the French diploma exams of orthopaedic and trauma surgery. Orthop Traumatol Surg Res. 2024 Dec 4. doi: 10.1016/j.otsr.2024.104080. [Epub ahead of print].
Giorgino R, Alessandri-Bonetti M, Luca A, et al. ChatGPT in orthopedics: a narrative review exploring the potential of artificial intelligence in orthopedic practice. Front Surg. 2023;10:1284015.
D'Agostino M, Feo F, Martora F, et al. ChatGPT and dermatology. Ital J Dermatol Venerol. 2024;159:566-71.
Chen TC, Multala E, Kearns P, et al. Assessment of ChatGPT's performance on neurology written board examination questions. BMJ Neurol Open. 2023;5:e000530.
Karakas C, Brock D, Lakhotia A. Leveraging ChatGPT in the pediatric neurology clinic: practical considerations for use to improve efficiency and outcomes. Pediatr Neurol. 2023;148:157-63.
OpenAI, Achiam J, Adler S, et al. GPT-4 technical report. arXiv. 2023 Mar 15. doi: 10.48550/arXiv.2303.08774. [Preprint posted online].
Jin HK, Kim E. Performance of GPT-3.5 and GPT-4 on the Korean pharmacist licensing examination: comparison study. JMIR Med Educ. 2024;10:e57451.
Ulus SA. How does ChatGPT perform on the European board of orthopedics and traumatology examination? A comparative study. Academic Journal of Health Sciences. 2023;38:43-6.
Greif C, Mpunga N, Koopman IV, et al. Evaluating the effectiveness of ChatGPT4 in the diagnosis and workup of dermatologic conditions. Dermatol Online J. 2024;30. doi: 10.5070/D330464104.
Azizoglu M, Aydogdu B. How does ChatGPT perform on the European Board of Pediatric Surgery examination? A randomized comparative study. Academic Journal of Health Sciences. 2024;39:23-6.
Robinson EJ, Qiu C, Sands S, et al. Physician vs. AI-generated messages in urology: evaluation of accuracy, completeness, and preference by patients and physicians. World J Urol. 2024;43:48.
Zong H, Wu R, Cha J, et al. Large Language Models in worldwide medical exams: platform development and comprehensive analysis. J Med Internet Res. 2024;26:e66114.

Year 2025, , 201 - 205, 15.01.2025

Mustafa Azizoğlu , Sergey Klyuev

https://doi.org/10.37990/medr.1601528

Abstract

References

Demir S. Evaluation of responses to questions about keratoconus using ChatGPT-4.0, Google Gemini and Microsoft Copilot: a comparative study of large language models on Keratoconus. Eye Contact Lens. 2024 Dec 4. doi: 10.1097/ICL.0000000000001158. [Epub ahead of print].
Sun SH, Chen K, Anavim S, et al. Large language models with vision on diagnostic radiology board exam style questions. Acad Radiol. 2024 Dec 3. doi: 10.1016/j.acra.2024.11.028. [Epub ahead of print].
Galvis-García E, Vega-González FJ, Emura F, et al. Inteligencia artificial en la colonoscopia de tamizaje y la disminución del error. Cir Cir. 2023;91:411-21.
De Busser B, Roth L, De Loof H. The role of large language models in self-care: a study and benchmark on medicines and supplement guidance accuracy. Int J Clin Pharm. 2024 Dec 7. doi: 10.1007/s11096-024-01839-2. [Epub ahead of print].
Ardila CM, Yadalam PK. ChatGPT's influence on dental education: methodological challenges and ethical considerations. Int Dent J. 2024 Dec 6. doi: 10.1016/j.identj.2024.11.014. [Epub ahead of print].
Meo AS, Shaikh N, Meo SA. Assessing the accuracy and efficiency of Chat GPT-4 Omni (GPT-4o) in biomedical statistics: Comparative study with traditional tools. Saudi Med J. 2024;45:1383-90.
Chen Y, Huang X, Yang F, et al. Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study. BMC Med Educ. 2024;24:1372.
Bilgin IA, Percem AK, Aslan O. Artificial intelligence and robotic surgery in colorectal cancer surgery. J Clin Trials Exp Investig. 2024;3:83-4.
Yılmaz M. Revolutionizing laboratory medicine: the critical role of artificial intelligence and deep learning: Artificial intelligence and medical laboratory. The Injector. 2024;3:39-40.
Maraqa N, Samargandi R, Poichotte A, et al. Comparing performances of french orthopaedic surgery residents with the artificial intelligence ChatGPT-4/4o in the French diploma exams of orthopaedic and trauma surgery. Orthop Traumatol Surg Res. 2024 Dec 4. doi: 10.1016/j.otsr.2024.104080. [Epub ahead of print].
Giorgino R, Alessandri-Bonetti M, Luca A, et al. ChatGPT in orthopedics: a narrative review exploring the potential of artificial intelligence in orthopedic practice. Front Surg. 2023;10:1284015.
D'Agostino M, Feo F, Martora F, et al. ChatGPT and dermatology. Ital J Dermatol Venerol. 2024;159:566-71.
Chen TC, Multala E, Kearns P, et al. Assessment of ChatGPT's performance on neurology written board examination questions. BMJ Neurol Open. 2023;5:e000530.
Karakas C, Brock D, Lakhotia A. Leveraging ChatGPT in the pediatric neurology clinic: practical considerations for use to improve efficiency and outcomes. Pediatr Neurol. 2023;148:157-63.
OpenAI, Achiam J, Adler S, et al. GPT-4 technical report. arXiv. 2023 Mar 15. doi: 10.48550/arXiv.2303.08774. [Preprint posted online].
Jin HK, Kim E. Performance of GPT-3.5 and GPT-4 on the Korean pharmacist licensing examination: comparison study. JMIR Med Educ. 2024;10:e57451.
Ulus SA. How does ChatGPT perform on the European board of orthopedics and traumatology examination? A comparative study. Academic Journal of Health Sciences. 2023;38:43-6.
Greif C, Mpunga N, Koopman IV, et al. Evaluating the effectiveness of ChatGPT4 in the diagnosis and workup of dermatologic conditions. Dermatol Online J. 2024;30. doi: 10.5070/D330464104.
Azizoglu M, Aydogdu B. How does ChatGPT perform on the European Board of Pediatric Surgery examination? A randomized comparative study. Academic Journal of Health Sciences. 2024;39:23-6.
Robinson EJ, Qiu C, Sands S, et al. Physician vs. AI-generated messages in urology: evaluation of accuracy, completeness, and preference by patients and physicians. World J Urol. 2024;43:48.
Zong H, Wu R, Cha J, et al. Large Language Models in worldwide medical exams: platform development and comprehensive analysis. J Med Internet Res. 2024;26:e66114.

There are 21 citations in total.

Details

Primary Language	English
Subjects	Pediatric Urology
Journal Section	Original Articles
Authors	Mustafa Azizoğlu 0009-0000-3563-1230 Sergey Klyuev 0000-0002-3217-6874
Publication Date	January 15, 2025
Submission Date	December 14, 2024
Acceptance Date	January 10, 2025
Published in Issue	Year 2025

Cite

AMA	Azizoğlu M, Klyuev S. A Comparative Study on the Question-Answering Proficiency of Artificial Intelligence Models in Bladder-Related Conditions: An Evaluation of Gemini and ChatGPT 4.o. Med Records. January 2025;7(1):201-205. doi:10.37990/medr.1601528

Article Files

Full Text

Chief Editors

Assoc. Prof. Zülal Öner
Address: İzmir Bakırçay University, Department of Anatomy, İzmir, Turkey

Assoc. Prof. Deniz Şenol
Address: Düzce University, Department of Anatomy, Düzce, Turkey

Editors
Assoc. Prof. Serkan Öner
İzmir Bakırçay University, Department of Radiology, İzmir, Türkiye

E-mail: medrecsjournal@gmail.com

Publisher:
Medical Records Association (Tıbbi Kayıtlar Derneği)
Address: Orhangazi Neighborhood, 440th Street,
Green Life Complex, Block B, Floor 3, No. 69
Düzce, Türkiye
Web: www.tibbikayitlar.org.tr

Publication Support:

Effect Publishing & Agency
Phone: + 90 (540) 035 44 35
E-mail: info@effectpublishing.com
Address: Akdeniz Neighborhood, Şehit Fethi Bey Street,
No: 66/B, Ground floor, 35210 Konak/İzmir, Türkiye
web: www.effectpublishing.com