Research Article
BibTex RIS Cite

Benchmarking Artificial Intelligence Models for Clinical Guidance in Nocturia and Nocturnal Polyuria: A Comparative Evaluation of ChatGPT, Gemini, Copilot, and Perplexity

Year 2025, Volume: 20 Issue: 3, 183 - 192, 20.10.2025
https://doi.org/10.33719/nju1730282

Abstract

Objective: This study aimed to evaluate and compare the performance of four artificial intelligence (AI) models—ChatGPT-4.0, Gemini 1.5 Pro, Copilot, and Perplexity Pro—in answering clinical questions about nocturia and nocturnal polyuria.
Material and Methods: A total of 25 standardized clinical questions were developed across five thematic domains: general understanding, etiology and pathophysiology, diagnostic work-up, management strategies, and special populations. Responses from each AI model were scored by two blinded expert urologists using a five-point Likert scale across five quality domains: relevance, clarity, structure, utility, and factual accuracy. Mean scores were compared using repeated measures ANOVA or Friedman tests depending on data distribution. Inter-rater reliability was measured via the intraclass correlation coefficient (ICC).
Results: ChatGPT-4.0 and Perplexity Pro achieved the highest overall mean scores (4.61/5 and 4.52/5), significantly outperforming Gemini (4.35/5) and Copilot (3.63/5) (p = 0.032). ChatGPT scored highest in “general understanding” (4.86/5, p = 0.018), while Perplexity led in “management strategies” (4.74/5, p = 0.021). Copilot consistently scored lowest, particularly in “diagnostic work-up” (3.42/5, p = 0.008). In quality domain analysis, ChatGPT and Perplexity again outperformed others, especially in “factual accuracy” (4.48/5 and 4.44/5), with Copilot trailing (3.54/5, p = 0.001). Inter-rater reliability was excellent (ICC = 0.91).
Conclusion: ChatGPT and Perplexity Pro demonstrated strong performance in delivering clinically relevant and accurate information on nocturia and nocturnal polyuria. These findings suggest their potential as supportive tools for education and decision-making. Copilot’s lower performance underscores the need for continued model refinement. AI integration in clinical contexts should remain guided by expert validation and alignment with current urological guidelines.

Ethical Statement

This study did not involve human participants, animal subjects, or patient data. Therefore, ethical approval was not required in accordance with institutional and national research committee standards. All AI models were accessed through publicly available platforms under their respective terms of use.

Supporting Institution

There was no institutional, commercial, or personal financial funding received for this research.

References

  • 1. Tyagi S, Chancellor MB. Nocturnal polyuria and nocturia. Int Urol Nephrol 2023;55:1395-401. https://doi.org/10.1007/S11255-023-03582-5
  • 2. Weiss JP, Everaert K. Management of Nocturia and Nocturnal Polyuria. Urology 2019;133:24-33. https:// doi.org/10.1016/J.UROLOGY.2019.09.022
  • 3. Lavadia AC, Kim JH, Yun SW, Noh T Il. Nocturia, Sleep Quality, and Mortality: A Systematic Review and MetaAnalysis. World J Mens Health 2025;43. https://doi. org/10.5534/WJMH.240237
  • 4. Oelke M, De Wachter S, Drake MJ, Giannantoni A, Kirby M, Orme S, et al. A practical approach to the management of nocturia. Int J Clin Pract 2017;71:e13027. https://doi.org/10.5534/WJMH.240237
  • 5. ChatGPT version 4.0 [Internet]. OpenAI [cited 2025 Apr 18]. Available from: https://chatgpt.com/
  • 6. Gemini 1.5 Pro [Internet]. Google DeepMind [cited 2025 Apr 18]. Available from: https://gemini.google. com/ 7. Perplexity Pro [Internet]. Perplexity AI [cited 2025 Apr 18]. Available from: https://www.perplexity.ai/
  • 8. Copilot (GPT-4-based) [Internet]. GitHub [cited 2025 Apr 18]. Available from: https://copilot.microsoft.com/
  • 9. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ 2023;9:e48291. https://doi.org/10.2196/48291
  • 10. Gupta R, Pedraza AM, Gorin MA, Tewari AK. Defining the Role of Large Language Models in Urologic Care and Research. Eur Urol Oncol 2024;7:1–13. https://doi.org/10.1016/j.euo.2023.07.017
  • 11. Song H, Xia Y, Luo Z, Liu H, Song Y, Zeng X, et al. Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. J Med Syst 2023;47:1–9. https://doi.org/10.1007/S10916-023-02021-3
  • 12. Chervenak J, Lieman H, Blanco-Breindel M, Jindal S. The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations. Fertil Steril 2023;120:575–83. https://doi.org/10.1016/J. FERTNSTERT.2023.05.151
  • 13. Ferber D, Kather JN. Large Language Models in Urooncology. Eur Urol Oncol 2024;7:157–9. https://doi.org/10.1016/j.euo.2023.09.019
  • 14. Şahin B, Genç YE, Doğan K, Şener TE, Şekerci ÇA, Tanıdır Y, et al. Evaluating the Performance of ChatGPT in Urology: A Comparative Study of Knowledge Interpretation and Patient Guidance. J Endourol 2024;38:799–808. https://doi.org/10.1089/ END.2023.0413
  • 15. Hacibey I, Halis A. Assessment of artificial intelligence performance in answering questions on onabotulinum toxin and sacral neuromodulation. Investig Clin Urol 2025;66:18893. https://doi.org/10.4111/ICU.20250040
  • 16. Joshi A, Kale S, Chandel S, Pal D. Likert scale: Explored and explained. Br J Appl Sci Technol 2015;BJAST:157. https://doi.org/10.9734/BJAST/2015/14975.
  • 17. Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 2016;15:155-63. https://doi.org/10.1016/J.JCM.2016.02.012
  • 18. Altıntaş E, Ozkent MS, Gül M, Batur AF, Kaynar M, Kılıç Ö, et al. Comparative analysis of artificialintelligence chatbot recommendations for urolithiasis management: A study of EAU guideline compliance. French J Urol 2024;34:102666. https://doi.org/10.1016/J. FJUROL.2024.102666
  • 19. Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 2023;21:1-4. https://doi.org/10.1186/S12967-023-04123-5
  • 20. Caglar U, Yildiz O, Meric A, Ayranci A, Gelmis M, Sarilar O, et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. J Pediatr Urol 2024;20:26.e1-26.e5. https://doi.org/10.1016/J.JPUROL.2023.08.003
  • 21. Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, et al. Medical foundation large language models for comprehensive text analysis and beyond. NPJ Digit Med 2025;8:11-0. https://doi.org/10.1038/S41746-025-01533-1
  • 22. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD, et al. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. MedRxiv 2023:2023.02.02.23285399. https://doi.org/10.1101/2023.02.02.23285399
  • 23. Lombardo R, Gallo G, Stira J, Turchi B, Santoro G, Riolo S, et al. Quality of information and appropriateness of Open AI outputs for prostate cancer. Prostate Cancer Prostatic Dis 2024;28:229-31. https://doi.org/10.1038/S41391-024-00789-0
  • 24. Şahin MF, Ateş H, Keleş A, Özcan R, Doğan Ç, Akgül M, et al. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis. J Med Syst 2024;48:1-6. https://doi.org/10.1007/S10916-024-02056- 0
  • 25. Shah M, Naik N, Somani BK, Hameed BMZ. Artificial intelligence (AI) in urology-Current use and future directions: An iTRUE study. Turk J Urol 2020;46:S27. https://doi.org/10.5152/TUD.2020.20117
  • 26. de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med 2022;5:1– 13. https://doi.org/10.1038/S41746-021-00549-7
  • 27. Saraswat D, Bhattacharya P, Verma A, Prasad VK, Tanwar S, Sharma G, et al. Explainable AI for Healthcare 5.0: Opportunities and Challenges. IEEE Access 2022;10:84486–517. https://doi.org/10.1109/ ACCESS.2022.3197671
  • 28. Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology 2023;307:2023. https://doi.org/10.1148/RADIOL.230163
  • 29. Almada M, Petit N. The EU AI Act: Between the rock of product safety and the hard place of fundamental rights. Common Market Law Review 2025;62:85-120. https:// doi.org/10.54648/COLA2025004

Noktüri ve Noktürnal Polüri İçin Klinik Rehberlikte Yapay Zekâ Modellerinin Karşılaştırmalı Değerlendirmesi: ChatGPT, Gemini, Copilot ve Perplexity Analizi

Year 2025, Volume: 20 Issue: 3, 183 - 192, 20.10.2025
https://doi.org/10.33719/nju1730282

Abstract

Amaç: Bu çalışma, noktüri ve noktürnal poliüriye ilişkin klinik soruları yanıtlama konusunda dört yapay zekâ (YZ) modelinin—ChatGPT-4.0, Gemini 1.5 Pro, Copilot ve Perplexity Pro—performansını değerlendirmeyi ve karşılaştırmayı amaçladı.
Yöntemler: Beş tematik başlık altında (genel bilgi, etiyoloji ve patofizyoloji, tanısal yaklaşım, tedavi stratejileri ve özel popülasyonlar) toplam 25 standartlaştırılmış klinik soru oluşturuldu. Her bir YZ modelinin yanıtları, beş kalite alanı (ilgililik, açıklık, yapı, klinik fayda ve olgusal doğruluk) üzerinden, beşli Likert ölçeği kullanılarak iki kör üroloji uzmanı tarafından puanlandı. Ortalama skorlar, veri dağılımına göre tekrarlayan ölçümler için ANOVA veya Friedman testi ile karşılaştırıldı. Gözlemciler arası uyum, sınıf içi korelasyon katsayısı (ICC) ile değerlendirildi.
Bulgular: ChatGPT-4.0 ve Perplexity Pro, sırasıyla 4,61/5 ve 4,52/5 genel ortalama skorlarla en yüksek puanları alarak Gemini (4,35/5) ve Copilot’un (3,63/5) anlamlı şekilde önüne geçti (p = 0,032). ChatGPT “genel bilgi” alanında en yüksek skoru aldı (4,86/5, p = 0,018), Perplexity ise “tedavi stratejileri” başlığında liderdi (4,74/5, p = 0,021). Copilot tüm alanlarda en düşük puanları aldı; özellikle “tanısal yaklaşım” alanında performansı düşüktü (3,42/5, p = 0,008). Kalite alanı analizinde, özellikle “gerçek doğruluğu” kriterinde ChatGPT ve Perplexity modelleri sırasıyla 4,48/5 ve 4,44/5 skorlarıyla yine üstünlük gösterdi; Copilot ise geride kaldı (3,54/5, p = 0,001). Gözlemciler arası uyum mükemmel düzeydeydi (ICC = 0,91).
Sonuç: ChatGPT ve Perplexity Pro, noktüri ve noktürnal poliüri hakkında klinik açıdan anlamlı ve doğru bilgiler sunmada güçlü bir performans sergilemiştir. Bu bulgular, ilgili modellerin eğitim ve klinik karar verme süreçlerinde destekleyici araçlar olarak kullanılma potansiyelini ortaya koymaktadır. Copilot’un daha düşük performansı, bu modellerin sürekli olarak geliştirilmesi gerektiğini vurgulamaktadır. Klinik uygulamalarda yapay zekâ entegrasyonunun, uzman değerlendirmesi ve güncel ürolojik kılavuzlarla uyum içinde gerçekleştirilmesi önem arz etmektedir.

References

  • 1. Tyagi S, Chancellor MB. Nocturnal polyuria and nocturia. Int Urol Nephrol 2023;55:1395-401. https://doi.org/10.1007/S11255-023-03582-5
  • 2. Weiss JP, Everaert K. Management of Nocturia and Nocturnal Polyuria. Urology 2019;133:24-33. https:// doi.org/10.1016/J.UROLOGY.2019.09.022
  • 3. Lavadia AC, Kim JH, Yun SW, Noh T Il. Nocturia, Sleep Quality, and Mortality: A Systematic Review and MetaAnalysis. World J Mens Health 2025;43. https://doi. org/10.5534/WJMH.240237
  • 4. Oelke M, De Wachter S, Drake MJ, Giannantoni A, Kirby M, Orme S, et al. A practical approach to the management of nocturia. Int J Clin Pract 2017;71:e13027. https://doi.org/10.5534/WJMH.240237
  • 5. ChatGPT version 4.0 [Internet]. OpenAI [cited 2025 Apr 18]. Available from: https://chatgpt.com/
  • 6. Gemini 1.5 Pro [Internet]. Google DeepMind [cited 2025 Apr 18]. Available from: https://gemini.google. com/ 7. Perplexity Pro [Internet]. Perplexity AI [cited 2025 Apr 18]. Available from: https://www.perplexity.ai/
  • 8. Copilot (GPT-4-based) [Internet]. GitHub [cited 2025 Apr 18]. Available from: https://copilot.microsoft.com/
  • 9. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ 2023;9:e48291. https://doi.org/10.2196/48291
  • 10. Gupta R, Pedraza AM, Gorin MA, Tewari AK. Defining the Role of Large Language Models in Urologic Care and Research. Eur Urol Oncol 2024;7:1–13. https://doi.org/10.1016/j.euo.2023.07.017
  • 11. Song H, Xia Y, Luo Z, Liu H, Song Y, Zeng X, et al. Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. J Med Syst 2023;47:1–9. https://doi.org/10.1007/S10916-023-02021-3
  • 12. Chervenak J, Lieman H, Blanco-Breindel M, Jindal S. The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations. Fertil Steril 2023;120:575–83. https://doi.org/10.1016/J. FERTNSTERT.2023.05.151
  • 13. Ferber D, Kather JN. Large Language Models in Urooncology. Eur Urol Oncol 2024;7:157–9. https://doi.org/10.1016/j.euo.2023.09.019
  • 14. Şahin B, Genç YE, Doğan K, Şener TE, Şekerci ÇA, Tanıdır Y, et al. Evaluating the Performance of ChatGPT in Urology: A Comparative Study of Knowledge Interpretation and Patient Guidance. J Endourol 2024;38:799–808. https://doi.org/10.1089/ END.2023.0413
  • 15. Hacibey I, Halis A. Assessment of artificial intelligence performance in answering questions on onabotulinum toxin and sacral neuromodulation. Investig Clin Urol 2025;66:18893. https://doi.org/10.4111/ICU.20250040
  • 16. Joshi A, Kale S, Chandel S, Pal D. Likert scale: Explored and explained. Br J Appl Sci Technol 2015;BJAST:157. https://doi.org/10.9734/BJAST/2015/14975.
  • 17. Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 2016;15:155-63. https://doi.org/10.1016/J.JCM.2016.02.012
  • 18. Altıntaş E, Ozkent MS, Gül M, Batur AF, Kaynar M, Kılıç Ö, et al. Comparative analysis of artificialintelligence chatbot recommendations for urolithiasis management: A study of EAU guideline compliance. French J Urol 2024;34:102666. https://doi.org/10.1016/J. FJUROL.2024.102666
  • 19. Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 2023;21:1-4. https://doi.org/10.1186/S12967-023-04123-5
  • 20. Caglar U, Yildiz O, Meric A, Ayranci A, Gelmis M, Sarilar O, et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. J Pediatr Urol 2024;20:26.e1-26.e5. https://doi.org/10.1016/J.JPUROL.2023.08.003
  • 21. Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, et al. Medical foundation large language models for comprehensive text analysis and beyond. NPJ Digit Med 2025;8:11-0. https://doi.org/10.1038/S41746-025-01533-1
  • 22. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD, et al. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. MedRxiv 2023:2023.02.02.23285399. https://doi.org/10.1101/2023.02.02.23285399
  • 23. Lombardo R, Gallo G, Stira J, Turchi B, Santoro G, Riolo S, et al. Quality of information and appropriateness of Open AI outputs for prostate cancer. Prostate Cancer Prostatic Dis 2024;28:229-31. https://doi.org/10.1038/S41391-024-00789-0
  • 24. Şahin MF, Ateş H, Keleş A, Özcan R, Doğan Ç, Akgül M, et al. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis. J Med Syst 2024;48:1-6. https://doi.org/10.1007/S10916-024-02056- 0
  • 25. Shah M, Naik N, Somani BK, Hameed BMZ. Artificial intelligence (AI) in urology-Current use and future directions: An iTRUE study. Turk J Urol 2020;46:S27. https://doi.org/10.5152/TUD.2020.20117
  • 26. de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med 2022;5:1– 13. https://doi.org/10.1038/S41746-021-00549-7
  • 27. Saraswat D, Bhattacharya P, Verma A, Prasad VK, Tanwar S, Sharma G, et al. Explainable AI for Healthcare 5.0: Opportunities and Challenges. IEEE Access 2022;10:84486–517. https://doi.org/10.1109/ ACCESS.2022.3197671
  • 28. Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology 2023;307:2023. https://doi.org/10.1148/RADIOL.230163
  • 29. Almada M, Petit N. The EU AI Act: Between the rock of product safety and the hard place of fundamental rights. Common Market Law Review 2025;62:85-120. https:// doi.org/10.54648/COLA2025004
There are 28 citations in total.

Details

Primary Language English
Subjects Urology
Journal Section Research Article
Authors

Gökhan Çeker 0000-0002-7891-9450

İsmail Ulus 0000-0002-2005-9734

İbrahim Hacıbey 0000-0002-2212-5504

Publication Date October 20, 2025
Submission Date June 30, 2025
Acceptance Date August 11, 2025
Published in Issue Year 2025 Volume: 20 Issue: 3

Cite

Vancouver Çeker G, Ulus İ, Hacıbey İ. Benchmarking Artificial Intelligence Models for Clinical Guidance in Nocturia and Nocturnal Polyuria: A Comparative Evaluation of ChatGPT, Gemini, Copilot, and Perplexity. New J Urol. 2025;20(3):183-92.