Research Article
BibTex RIS Cite

AI-Assisted Knowledge Assessment: Comparison of ChatGPT and Gemini on Undescended Testicle in Children

Year 2025, Volume: 5 Issue: 3, 93 - 97, 23.09.2025

Abstract

Aim:
This study aimed to evaluate the accuracy and completeness of ChatGPT-4 and Google Gemini in answering questions about undescended testis, as these AI tools can sometimes provide seemingly accurate but incorrect information, raising caution in medical applications.
Methods:
Researchers created 20 identical questions independently and submitted them to both ChatGPT-4 and Google Gemini.A pediatrician and a pediatric surgeon evaluated the responses for accuracy, using the Johnson et al. scale (accuracy rated from 1 to 6 and completeness from 1 to 3).Responses that lacked content received a score of 0. Statistical analyses were performed using R Software (version 4.3.1) to assess differences in accuracy and consistency between the tools.
Results:
Both chatbots answered all questions, with ChatGPT achieving a median accuracy score of 5.5 and a mean score of 5.35, while Google Gemini had a median score of 6 and a mean of 5.5. Completeness was similar, with ChatGPT scoring a median of 3 and Google Gemini showing comparable performance.
Conclusion:
ChatGPT and Google Gemini showed comparable accuracy and completeness; however, inconsistencies between accuracy and completeness suggest these AI tools require refinement.Regular updates are essential to improve the reliability of AI-generated medical information on UDT and ensure up-to-date, accurate responses.

References

  • 1. Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment. Can Assoc Radiol J. 2024;75(2):344-50.
  • 2. Haid B, Rein P, Oswald J. Undescended testes: Diagnostic Algorithm and Treatment. Eur Urol Focus. 2017;3(2-3):155-7.
  • 3. Bradshaw CJ, Corbet-Burcher G, Hitchcock R. Age at orchidopexy in the UK: has new evidence changed practice? J Pediatr Urol. 2014;10(4):758-62.
  • 4. Kolon TF, Herndon CD, Baker LA, Baskin LS, Baxter CG, Cheng EY, et al. Evaluation and treatment of cryptorchidism: AUA guideline. J Urol. 2014;192(2):337-45.
  • 5. Promm M, Dittrich A, Brandstetter S, Fill-Malfertheiner S, Melter M, Seelbach-Göbel B, et al. Evaluation of Undescended Testes in Newborns: It Is Really Simple, Just Not Easy. Urol Int. 2021;105(11-12):1034-8.
  • 6. Holland AJ, Nassar N, Schneuer FJ. Undescended testes: an update. Curr Opin Pediatr. 2016;28(3):388-94.
  • 7. Batra NV, DeMarco RT, Bayne CE. A narrative review of the history and evidence-base for the timing of orchidopexy for cryptorchidism. J Pediatr Urol. 2021;17(2):239-45.
  • 8. Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study. J Med Internet Res. 2023;25:e51580.
  • 9. McMullan M. Patients using the Internet to obtain health information: how this affects the patient-health professional relationship. Patient Educ Couns. 2006;63(1-2):24-8.
  • 10. Ozdemir Kacer E, Kacer I. Evaluating the quality and reliability of YouTube videos on scabies in children: A cross-sectional study. PloS one. 2024;19(10):e0310508.
  • 11. Wong ZSY, Zhou J, Zhang Q. Artificial Intelligence for infectious disease Big Data Analytics. Infect Dis Health. 2019;24(1):44-8.
  • 12. Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Res Sq. 2023.
  • 13. Durmaz Engin C, Karatas E, Ozturk T. Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity. Children (Basel). 2024;11(6).
  • 14. Pirkle S, Yang J, Blumberg TJ. Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions? J Pediatr Orthop. 2024.
  • 15. Khromchenko K, Shaikh S, Singh M, Vurture G, Rana RA, Baum JD. ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions? Cureus. 2024;16(7):e65543.
  • 16. Mediboina A, Badam RK, Chodavarapu S. Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI. Cureus. 2024;16(1):e51544.
  • 17. Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard. Radiology. 2023;307(5):e230922.
  • 18. Chervonski E, Harish KB, Rockman CB, Sadek M, Teter KA, Jacobowitz GR, et al. Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients. Vascular. 2024:17085381241240550.
  • 19. Cung M, Sosa B, Yang HS, McDonald MM, Matthews BG, Vlug AG, et al. The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries. J Bone Miner Res. 2024;39(2):106-15.
  • 20. Tong L, Zhang C, Liu R, Yang J, Sun Z. Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis. J Orthop Surg Res. 2024;19(1):574.1.

Yapay Zeka Destekli Karşılaştırma: Çocuklarda İnmemiş Testis Bilgisi Konusunda ChatGPT ve Gemini Karşılaştırması

Year 2025, Volume: 5 Issue: 3, 93 - 97, 23.09.2025

Abstract

Amaç:
Bu çalışma, ChatGPT-4 ve Google Gemini'nin inmemiş testisle ilgili soruları yanıtlamadaki doğruluğunu ve eksiksizliğini değerlendirmeyi amaçlamıştır. Çünkü bu yapay zeka araçları bazen görünüşte doğru ama yanlış bilgiler sağlayabilmektedir ve bu da tıbbi uygulamalarda dikkatli olunmasını gerektirmektedir.
Yöntemler:
Araştırmacılar, 20 özdeş soruyu bağımsız olarak oluşturup hem ChatGPT-4 hem de Google Gemini'ye göndermişlerdir. Bir çocuk doktoru ve bir çocuk cerrahı, yanıtları doğruluk açısından Johnson ve ark. ölçeğini (doğruluk 1 ile 6 arasında, eksiksizlik ise 1 ile 3 arasında derecelendirilmiştir) kullanarak değerlendirmiştir. İçerik içermeyen yanıtlar 0 puan almıştır. Araçlar arasındaki doğruluk ve tutarlılık farklılıklarını değerlendirmek için istatistiksel analizler R Yazılımı (sürüm 4.3.1) kullanılarak gerçekleştirilmiştir.
Sonuçlar:
Her iki sohbet robotu da tüm soruları yanıtlamış; ChatGPT'nin ortanca doğruluk puanı 5,5 ve ortalama puanı 5,35 iken, Google Gemini'nin ortanca puanı 6 ve ortalama puanı 5,5 olmuştur. Tamlık benzerdi; ChatGPT'nin ortalama puanı 3 iken, Google Gemini benzer bir performans gösterdi.
Sonuç:
ChatGPT ve Google Gemini benzer doğruluk ve tamlık gösterdi; ancak doğruluk ve tamlık arasındaki tutarsızlıklar, bu yapay zeka araçlarının iyileştirilmesi gerektiğini gösteriyor. UDT'de yapay zeka tarafından oluşturulan tıbbi bilgilerin güvenilirliğini artırmak ve güncel, doğru yanıtlar sağlamak için düzenli güncellemeler şarttır.

References

  • 1. Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment. Can Assoc Radiol J. 2024;75(2):344-50.
  • 2. Haid B, Rein P, Oswald J. Undescended testes: Diagnostic Algorithm and Treatment. Eur Urol Focus. 2017;3(2-3):155-7.
  • 3. Bradshaw CJ, Corbet-Burcher G, Hitchcock R. Age at orchidopexy in the UK: has new evidence changed practice? J Pediatr Urol. 2014;10(4):758-62.
  • 4. Kolon TF, Herndon CD, Baker LA, Baskin LS, Baxter CG, Cheng EY, et al. Evaluation and treatment of cryptorchidism: AUA guideline. J Urol. 2014;192(2):337-45.
  • 5. Promm M, Dittrich A, Brandstetter S, Fill-Malfertheiner S, Melter M, Seelbach-Göbel B, et al. Evaluation of Undescended Testes in Newborns: It Is Really Simple, Just Not Easy. Urol Int. 2021;105(11-12):1034-8.
  • 6. Holland AJ, Nassar N, Schneuer FJ. Undescended testes: an update. Curr Opin Pediatr. 2016;28(3):388-94.
  • 7. Batra NV, DeMarco RT, Bayne CE. A narrative review of the history and evidence-base for the timing of orchidopexy for cryptorchidism. J Pediatr Urol. 2021;17(2):239-45.
  • 8. Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study. J Med Internet Res. 2023;25:e51580.
  • 9. McMullan M. Patients using the Internet to obtain health information: how this affects the patient-health professional relationship. Patient Educ Couns. 2006;63(1-2):24-8.
  • 10. Ozdemir Kacer E, Kacer I. Evaluating the quality and reliability of YouTube videos on scabies in children: A cross-sectional study. PloS one. 2024;19(10):e0310508.
  • 11. Wong ZSY, Zhou J, Zhang Q. Artificial Intelligence for infectious disease Big Data Analytics. Infect Dis Health. 2019;24(1):44-8.
  • 12. Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Res Sq. 2023.
  • 13. Durmaz Engin C, Karatas E, Ozturk T. Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity. Children (Basel). 2024;11(6).
  • 14. Pirkle S, Yang J, Blumberg TJ. Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions? J Pediatr Orthop. 2024.
  • 15. Khromchenko K, Shaikh S, Singh M, Vurture G, Rana RA, Baum JD. ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions? Cureus. 2024;16(7):e65543.
  • 16. Mediboina A, Badam RK, Chodavarapu S. Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI. Cureus. 2024;16(1):e51544.
  • 17. Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard. Radiology. 2023;307(5):e230922.
  • 18. Chervonski E, Harish KB, Rockman CB, Sadek M, Teter KA, Jacobowitz GR, et al. Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients. Vascular. 2024:17085381241240550.
  • 19. Cung M, Sosa B, Yang HS, McDonald MM, Matthews BG, Vlug AG, et al. The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries. J Bone Miner Res. 2024;39(2):106-15.
  • 20. Tong L, Zhang C, Liu R, Yang J, Sun Z. Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis. J Orthop Surg Res. 2024;19(1):574.1.
There are 20 citations in total.

Details

Primary Language English
Subjects Clinical Sciences (Other)
Journal Section Research Articles
Authors

Emine Özdemir Kaçer 0000-0002-0111-1672

Mustafa Tuşat 0000-0003-2327-4250

Murat Kılıçaslan 0000-0003-1243-9830

Sebahattin Memiş 0000-0002-3829-9218

Publication Date September 23, 2025
Submission Date July 10, 2025
Acceptance Date August 11, 2025
Published in Issue Year 2025 Volume: 5 Issue: 3

Cite