Benchmarking Artificial Intelligence Models for Clinical Guidance in Nocturia and Nocturnal Polyuria: A Comparative Evaluation of ChatGPT, Gemini, Copilot, and Perplexity

Gökhan Çeker; İsmail Ulus; İbrahim Hacıbey

doi:10.33719/nju1730282

Research Article

BibTex

RIS

Cite

Benchmarking Artificial Intelligence Models for Clinical Guidance in Nocturia and Nocturnal Polyuria: A Comparative Evaluation of ChatGPT, Gemini, Copilot, and Perplexity

Year 2025, Volume: 20 Issue: 3, 183 - 192, 20.10.2025

Gökhan Çeker , İsmail Ulus , İbrahim Hacıbey

https://doi.org/10.33719/nju1730282

Abstract

Objective: This study aimed to evaluate and compare the performance of four artificial intelligence (AI) models—ChatGPT-4.0, Gemini 1.5 Pro, Copilot, and Perplexity Pro—in answering clinical questions about nocturia and nocturnal polyuria.
Material and Methods: A total of 25 standardized clinical questions were developed across five thematic domains: general understanding, etiology and pathophysiology, diagnostic work-up, management strategies, and special populations. Responses from each AI model were scored by two blinded expert urologists using a five-point Likert scale across five quality domains: relevance, clarity, structure, utility, and factual accuracy. Mean scores were compared using repeated measures ANOVA or Friedman tests depending on data distribution. Inter-rater reliability was measured via the intraclass correlation coefficient (ICC).
Results: ChatGPT-4.0 and Perplexity Pro achieved the highest overall mean scores (4.61/5 and 4.52/5), significantly outperforming Gemini (4.35/5) and Copilot (3.63/5) (p = 0.032). ChatGPT scored highest in “general understanding” (4.86/5, p = 0.018), while Perplexity led in “management strategies” (4.74/5, p = 0.021). Copilot consistently scored lowest, particularly in “diagnostic work-up” (3.42/5, p = 0.008). In quality domain analysis, ChatGPT and Perplexity again outperformed others, especially in “factual accuracy” (4.48/5 and 4.44/5), with Copilot trailing (3.54/5, p = 0.001). Inter-rater reliability was excellent (ICC = 0.91).
Conclusion: ChatGPT and Perplexity Pro demonstrated strong performance in delivering clinically relevant and accurate information on nocturia and nocturnal polyuria. These findings suggest their potential as supportive tools for education and decision-making. Copilot’s lower performance underscores the need for continued model refinement. AI integration in clinical contexts should remain guided by expert validation and alignment with current urological guidelines.

Keywords

artificial intelligence , large language models , nocturia , nocturnal polyuria

Ethical Statement

This study did not involve human participants, animal subjects, or patient data. Therefore, ethical approval was not required in accordance with institutional and national research committee standards. All AI models were accessed through publicly available platforms under their respective terms of use.

Supporting Institution

There was no institutional, commercial, or personal financial funding received for this research.

References

1. Tyagi S, Chancellor MB. Nocturnal polyuria and nocturia. Int Urol Nephrol 2023;55:1395-401. https://doi.org/10.1007/S11255-023-03582-5
2. Weiss JP, Everaert K. Management of Nocturia and Nocturnal Polyuria. Urology 2019;133:24-33. https:// doi.org/10.1016/J.UROLOGY.2019.09.022
3. Lavadia AC, Kim JH, Yun SW, Noh T Il. Nocturia, Sleep Quality, and Mortality: A Systematic Review and MetaAnalysis. World J Mens Health 2025;43. https://doi. org/10.5534/WJMH.240237
4. Oelke M, De Wachter S, Drake MJ, Giannantoni A, Kirby M, Orme S, et al. A practical approach to the management of nocturia. Int J Clin Pract 2017;71:e13027. https://doi.org/10.5534/WJMH.240237
5. ChatGPT version 4.0 [Internet]. OpenAI [cited 2025 Apr 18]. Available from: https://chatgpt.com/
6. Gemini 1.5 Pro [Internet]. Google DeepMind [cited 2025 Apr 18]. Available from: https://gemini.google. com/ 7. Perplexity Pro [Internet]. Perplexity AI [cited 2025 Apr 18]. Available from: https://www.perplexity.ai/
8. Copilot (GPT-4-based) [Internet]. GitHub [cited 2025 Apr 18]. Available from: https://copilot.microsoft.com/
9. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ 2023;9:e48291. https://doi.org/10.2196/48291
10. Gupta R, Pedraza AM, Gorin MA, Tewari AK. Defining the Role of Large Language Models in Urologic Care and Research. Eur Urol Oncol 2024;7:1–13. https://doi.org/10.1016/j.euo.2023.07.017
11. Song H, Xia Y, Luo Z, Liu H, Song Y, Zeng X, et al. Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. J Med Syst 2023;47:1–9. https://doi.org/10.1007/S10916-023-02021-3
12. Chervenak J, Lieman H, Blanco-Breindel M, Jindal S. The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations. Fertil Steril 2023;120:575–83. https://doi.org/10.1016/J. FERTNSTERT.2023.05.151
13. Ferber D, Kather JN. Large Language Models in Urooncology. Eur Urol Oncol 2024;7:157–9. https://doi.org/10.1016/j.euo.2023.09.019
14. Şahin B, Genç YE, Doğan K, Şener TE, Şekerci ÇA, Tanıdır Y, et al. Evaluating the Performance of ChatGPT in Urology: A Comparative Study of Knowledge Interpretation and Patient Guidance. J Endourol 2024;38:799–808. https://doi.org/10.1089/ END.2023.0413
15. Hacibey I, Halis A. Assessment of artificial intelligence performance in answering questions on onabotulinum toxin and sacral neuromodulation. Investig Clin Urol 2025;66:18893. https://doi.org/10.4111/ICU.20250040
16. Joshi A, Kale S, Chandel S, Pal D. Likert scale: Explored and explained. Br J Appl Sci Technol 2015;BJAST:157. https://doi.org/10.9734/BJAST/2015/14975.
17. Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 2016;15:155-63. https://doi.org/10.1016/J.JCM.2016.02.012
18. Altıntaş E, Ozkent MS, Gül M, Batur AF, Kaynar M, Kılıç Ö, et al. Comparative analysis of artificialintelligence chatbot recommendations for urolithiasis management: A study of EAU guideline compliance. French J Urol 2024;34:102666. https://doi.org/10.1016/J. FJUROL.2024.102666
19. Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 2023;21:1-4. https://doi.org/10.1186/S12967-023-04123-5
20. Caglar U, Yildiz O, Meric A, Ayranci A, Gelmis M, Sarilar O, et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. J Pediatr Urol 2024;20:26.e1-26.e5. https://doi.org/10.1016/J.JPUROL.2023.08.003
21. Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, et al. Medical foundation large language models for comprehensive text analysis and beyond. NPJ Digit Med 2025;8:11-0. https://doi.org/10.1038/S41746-025-01533-1
22. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD, et al. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. MedRxiv 2023:2023.02.02.23285399. https://doi.org/10.1101/2023.02.02.23285399
23. Lombardo R, Gallo G, Stira J, Turchi B, Santoro G, Riolo S, et al. Quality of information and appropriateness of Open AI outputs for prostate cancer. Prostate Cancer Prostatic Dis 2024;28:229-31. https://doi.org/10.1038/S41391-024-00789-0
24. Şahin MF, Ateş H, Keleş A, Özcan R, Doğan Ç, Akgül M, et al. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis. J Med Syst 2024;48:1-6. https://doi.org/10.1007/S10916-024-02056- 0
25. Shah M, Naik N, Somani BK, Hameed BMZ. Artificial intelligence (AI) in urology-Current use and future directions: An iTRUE study. Turk J Urol 2020;46:S27. https://doi.org/10.5152/TUD.2020.20117
26. de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med 2022;5:1– 13. https://doi.org/10.1038/S41746-021-00549-7
27. Saraswat D, Bhattacharya P, Verma A, Prasad VK, Tanwar S, Sharma G, et al. Explainable AI for Healthcare 5.0: Opportunities and Challenges. IEEE Access 2022;10:84486–517. https://doi.org/10.1109/ ACCESS.2022.3197671
28. Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology 2023;307:2023. https://doi.org/10.1148/RADIOL.230163
29. Almada M, Petit N. The EU AI Act: Between the rock of product safety and the hard place of fundamental rights. Common Market Law Review 2025;62:85-120. https:// doi.org/10.54648/COLA2025004

Noktüri ve Noktürnal Polüri İçin Klinik Rehberlikte Yapay Zekâ Modellerinin Karşılaştırmalı Değerlendirmesi: ChatGPT, Gemini, Copilot ve Perplexity Analizi

Year 2025, Volume: 20 Issue: 3, 183 - 192, 20.10.2025

Gökhan Çeker , İsmail Ulus , İbrahim Hacıbey

https://doi.org/10.33719/nju1730282

Abstract

Amaç: Bu çalışma, noktüri ve noktürnal poliüriye ilişkin klinik soruları yanıtlama konusunda dört yapay zekâ (YZ) modelinin—ChatGPT-4.0, Gemini 1.5 Pro, Copilot ve Perplexity Pro—performansını değerlendirmeyi ve karşılaştırmayı amaçladı.
Yöntemler: Beş tematik başlık altında (genel bilgi, etiyoloji ve patofizyoloji, tanısal yaklaşım, tedavi stratejileri ve özel popülasyonlar) toplam 25 standartlaştırılmış klinik soru oluşturuldu. Her bir YZ modelinin yanıtları, beş kalite alanı (ilgililik, açıklık, yapı, klinik fayda ve olgusal doğruluk) üzerinden, beşli Likert ölçeği kullanılarak iki kör üroloji uzmanı tarafından puanlandı. Ortalama skorlar, veri dağılımına göre tekrarlayan ölçümler için ANOVA veya Friedman testi ile karşılaştırıldı. Gözlemciler arası uyum, sınıf içi korelasyon katsayısı (ICC) ile değerlendirildi.
Bulgular: ChatGPT-4.0 ve Perplexity Pro, sırasıyla 4,61/5 ve 4,52/5 genel ortalama skorlarla en yüksek puanları alarak Gemini (4,35/5) ve Copilot’un (3,63/5) anlamlı şekilde önüne geçti (p = 0,032). ChatGPT “genel bilgi” alanında en yüksek skoru aldı (4,86/5, p = 0,018), Perplexity ise “tedavi stratejileri” başlığında liderdi (4,74/5, p = 0,021). Copilot tüm alanlarda en düşük puanları aldı; özellikle “tanısal yaklaşım” alanında performansı düşüktü (3,42/5, p = 0,008). Kalite alanı analizinde, özellikle “gerçek doğruluğu” kriterinde ChatGPT ve Perplexity modelleri sırasıyla 4,48/5 ve 4,44/5 skorlarıyla yine üstünlük gösterdi; Copilot ise geride kaldı (3,54/5, p = 0,001). Gözlemciler arası uyum mükemmel düzeydeydi (ICC = 0,91).
Sonuç: ChatGPT ve Perplexity Pro, noktüri ve noktürnal poliüri hakkında klinik açıdan anlamlı ve doğru bilgiler sunmada güçlü bir performans sergilemiştir. Bu bulgular, ilgili modellerin eğitim ve klinik karar verme süreçlerinde destekleyici araçlar olarak kullanılma potansiyelini ortaya koymaktadır. Copilot’un daha düşük performansı, bu modellerin sürekli olarak geliştirilmesi gerektiğini vurgulamaktadır. Klinik uygulamalarda yapay zekâ entegrasyonunun, uzman değerlendirmesi ve güncel ürolojik kılavuzlarla uyum içinde gerçekleştirilmesi önem arz etmektedir.

Keywords

yapay zeka , büyük dil modelleri , noktüri , nokturnal poliüri

References

1. Tyagi S, Chancellor MB. Nocturnal polyuria and nocturia. Int Urol Nephrol 2023;55:1395-401. https://doi.org/10.1007/S11255-023-03582-5
2. Weiss JP, Everaert K. Management of Nocturia and Nocturnal Polyuria. Urology 2019;133:24-33. https:// doi.org/10.1016/J.UROLOGY.2019.09.022
3. Lavadia AC, Kim JH, Yun SW, Noh T Il. Nocturia, Sleep Quality, and Mortality: A Systematic Review and MetaAnalysis. World J Mens Health 2025;43. https://doi. org/10.5534/WJMH.240237
4. Oelke M, De Wachter S, Drake MJ, Giannantoni A, Kirby M, Orme S, et al. A practical approach to the management of nocturia. Int J Clin Pract 2017;71:e13027. https://doi.org/10.5534/WJMH.240237
5. ChatGPT version 4.0 [Internet]. OpenAI [cited 2025 Apr 18]. Available from: https://chatgpt.com/
6. Gemini 1.5 Pro [Internet]. Google DeepMind [cited 2025 Apr 18]. Available from: https://gemini.google. com/ 7. Perplexity Pro [Internet]. Perplexity AI [cited 2025 Apr 18]. Available from: https://www.perplexity.ai/
8. Copilot (GPT-4-based) [Internet]. GitHub [cited 2025 Apr 18]. Available from: https://copilot.microsoft.com/
9. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ 2023;9:e48291. https://doi.org/10.2196/48291
10. Gupta R, Pedraza AM, Gorin MA, Tewari AK. Defining the Role of Large Language Models in Urologic Care and Research. Eur Urol Oncol 2024;7:1–13. https://doi.org/10.1016/j.euo.2023.07.017
11. Song H, Xia Y, Luo Z, Liu H, Song Y, Zeng X, et al. Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. J Med Syst 2023;47:1–9. https://doi.org/10.1007/S10916-023-02021-3
12. Chervenak J, Lieman H, Blanco-Breindel M, Jindal S. The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations. Fertil Steril 2023;120:575–83. https://doi.org/10.1016/J. FERTNSTERT.2023.05.151
13. Ferber D, Kather JN. Large Language Models in Urooncology. Eur Urol Oncol 2024;7:157–9. https://doi.org/10.1016/j.euo.2023.09.019
14. Şahin B, Genç YE, Doğan K, Şener TE, Şekerci ÇA, Tanıdır Y, et al. Evaluating the Performance of ChatGPT in Urology: A Comparative Study of Knowledge Interpretation and Patient Guidance. J Endourol 2024;38:799–808. https://doi.org/10.1089/ END.2023.0413
15. Hacibey I, Halis A. Assessment of artificial intelligence performance in answering questions on onabotulinum toxin and sacral neuromodulation. Investig Clin Urol 2025;66:18893. https://doi.org/10.4111/ICU.20250040
16. Joshi A, Kale S, Chandel S, Pal D. Likert scale: Explored and explained. Br J Appl Sci Technol 2015;BJAST:157. https://doi.org/10.9734/BJAST/2015/14975.
17. Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 2016;15:155-63. https://doi.org/10.1016/J.JCM.2016.02.012
18. Altıntaş E, Ozkent MS, Gül M, Batur AF, Kaynar M, Kılıç Ö, et al. Comparative analysis of artificialintelligence chatbot recommendations for urolithiasis management: A study of EAU guideline compliance. French J Urol 2024;34:102666. https://doi.org/10.1016/J. FJUROL.2024.102666
19. Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 2023;21:1-4. https://doi.org/10.1186/S12967-023-04123-5
20. Caglar U, Yildiz O, Meric A, Ayranci A, Gelmis M, Sarilar O, et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. J Pediatr Urol 2024;20:26.e1-26.e5. https://doi.org/10.1016/J.JPUROL.2023.08.003
21. Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, et al. Medical foundation large language models for comprehensive text analysis and beyond. NPJ Digit Med 2025;8:11-0. https://doi.org/10.1038/S41746-025-01533-1
22. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD, et al. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. MedRxiv 2023:2023.02.02.23285399. https://doi.org/10.1101/2023.02.02.23285399
23. Lombardo R, Gallo G, Stira J, Turchi B, Santoro G, Riolo S, et al. Quality of information and appropriateness of Open AI outputs for prostate cancer. Prostate Cancer Prostatic Dis 2024;28:229-31. https://doi.org/10.1038/S41391-024-00789-0
24. Şahin MF, Ateş H, Keleş A, Özcan R, Doğan Ç, Akgül M, et al. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis. J Med Syst 2024;48:1-6. https://doi.org/10.1007/S10916-024-02056- 0
25. Shah M, Naik N, Somani BK, Hameed BMZ. Artificial intelligence (AI) in urology-Current use and future directions: An iTRUE study. Turk J Urol 2020;46:S27. https://doi.org/10.5152/TUD.2020.20117
26. de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med 2022;5:1– 13. https://doi.org/10.1038/S41746-021-00549-7
27. Saraswat D, Bhattacharya P, Verma A, Prasad VK, Tanwar S, Sharma G, et al. Explainable AI for Healthcare 5.0: Opportunities and Challenges. IEEE Access 2022;10:84486–517. https://doi.org/10.1109/ ACCESS.2022.3197671
28. Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology 2023;307:2023. https://doi.org/10.1148/RADIOL.230163
29. Almada M, Petit N. The EU AI Act: Between the rock of product safety and the hard place of fundamental rights. Common Market Law Review 2025;62:85-120. https:// doi.org/10.54648/COLA2025004

There are 28 citations in total.

Details

Primary Language	English
Subjects	Urology
Journal Section	Research Article
Authors	Gökhan Çeker 0000-0002-7891-9450 İsmail Ulus 0000-0002-2005-9734 İbrahim Hacıbey 0000-0002-2212-5504
Publication Date	October 20, 2025
Submission Date	June 30, 2025
Acceptance Date	August 11, 2025
Published in Issue	Year 2025 Volume: 20 Issue: 3

Cite

Vancouver	Çeker G, Ulus İ, Hacıbey İ. Benchmarking Artificial Intelligence Models for Clinical Guidance in Nocturia and Nocturnal Polyuria: A Comparative Evaluation of ChatGPT, Gemini, Copilot, and Perplexity. New J Urol. 2025;20(3):183-92.

Download Cover Image

Article Files

Full Text