TY - JOUR T1 - Benchmarking Artificial Intelligence Models for Clinical Guidance in Nocturia and Nocturnal Polyuria: A Comparative Evaluation of ChatGPT, Gemini, Copilot, and Perplexity TT - Noktüri ve Noktürnal Polüri İçin Klinik Rehberlikte Yapay Zekâ Modellerinin Karşılaştırmalı Değerlendirmesi: ChatGPT, Gemini, Copilot ve Perplexity Analizi AU - Çeker, Gökhan AU - Ulus, İsmail AU - Hacıbey, İbrahim PY - 2025 DA - October Y2 - 2025 DO - 10.33719/nju1730282 JF - The New Journal of Urology JO - New J Urol PB - Ali İhsan TAŞÇI WT - DergiPark SN - 3023-6940 SP - 183 EP - 192 VL - 20 IS - 3 LA - en AB - Objective: This study aimed to evaluate and compare the performance of four artificial intelligence (AI) models—ChatGPT-4.0, Gemini 1.5 Pro, Copilot, and Perplexity Pro—in answering clinical questions about nocturia and nocturnal polyuria.Material and Methods: A total of 25 standardized clinical questions were developed across five thematic domains: general understanding, etiology and pathophysiology, diagnostic work-up, management strategies, and special populations. Responses from each AI model were scored by two blinded expert urologists using a five-point Likert scale across five quality domains: relevance, clarity, structure, utility, and factual accuracy. Mean scores were compared using repeated measures ANOVA or Friedman tests depending on data distribution. Inter-rater reliability was measured via the intraclass correlation coefficient (ICC).Results: ChatGPT-4.0 and Perplexity Pro achieved the highest overall mean scores (4.61/5 and 4.52/5), significantly outperforming Gemini (4.35/5) and Copilot (3.63/5) (p = 0.032). ChatGPT scored highest in “general understanding” (4.86/5, p = 0.018), while Perplexity led in “management strategies” (4.74/5, p = 0.021). Copilot consistently scored lowest, particularly in “diagnostic work-up” (3.42/5, p = 0.008). In quality domain analysis, ChatGPT and Perplexity again outperformed others, especially in “factual accuracy” (4.48/5 and 4.44/5), with Copilot trailing (3.54/5, p = 0.001). Inter-rater reliability was excellent (ICC = 0.91).Conclusion: ChatGPT and Perplexity Pro demonstrated strong performance in delivering clinically relevant and accurate information on nocturia and nocturnal polyuria. These findings suggest their potential as supportive tools for education and decision-making. Copilot’s lower performance underscores the need for continued model refinement. AI integration in clinical contexts should remain guided by expert validation and alignment with current urological guidelines. KW - artificial intelligence KW - large language models KW - nocturia KW - nocturnal polyuria N2 - Amaç: Bu çalışma, noktüri ve noktürnal poliüriye ilişkin klinik soruları yanıtlama konusunda dört yapay zekâ (YZ) modelinin—ChatGPT-4.0, Gemini 1.5 Pro, Copilot ve Perplexity Pro—performansını değerlendirmeyi ve karşılaştırmayı amaçladı.Yöntemler: Beş tematik başlık altında (genel bilgi, etiyoloji ve patofizyoloji, tanısal yaklaşım, tedavi stratejileri ve özel popülasyonlar) toplam 25 standartlaştırılmış klinik soru oluşturuldu. Her bir YZ modelinin yanıtları, beş kalite alanı (ilgililik, açıklık, yapı, klinik fayda ve olgusal doğruluk) üzerinden, beşli Likert ölçeği kullanılarak iki kör üroloji uzmanı tarafından puanlandı. Ortalama skorlar, veri dağılımına göre tekrarlayan ölçümler için ANOVA veya Friedman testi ile karşılaştırıldı. Gözlemciler arası uyum, sınıf içi korelasyon katsayısı (ICC) ile değerlendirildi.Bulgular: ChatGPT-4.0 ve Perplexity Pro, sırasıyla 4,61/5 ve 4,52/5 genel ortalama skorlarla en yüksek puanları alarak Gemini (4,35/5) ve Copilot’un (3,63/5) anlamlı şekilde önüne geçti (p = 0,032). ChatGPT “genel bilgi” alanında en yüksek skoru aldı (4,86/5, p = 0,018), Perplexity ise “tedavi stratejileri” başlığında liderdi (4,74/5, p = 0,021). Copilot tüm alanlarda en düşük puanları aldı; özellikle “tanısal yaklaşım” alanında performansı düşüktü (3,42/5, p = 0,008). Kalite alanı analizinde, özellikle “gerçek doğruluğu” kriterinde ChatGPT ve Perplexity modelleri sırasıyla 4,48/5 ve 4,44/5 skorlarıyla yine üstünlük gösterdi; Copilot ise geride kaldı (3,54/5, p = 0,001). Gözlemciler arası uyum mükemmel düzeydeydi (ICC = 0,91).Sonuç: ChatGPT ve Perplexity Pro, noktüri ve noktürnal poliüri hakkında klinik açıdan anlamlı ve doğru bilgiler sunmada güçlü bir performans sergilemiştir. Bu bulgular, ilgili modellerin eğitim ve klinik karar verme süreçlerinde destekleyici araçlar olarak kullanılma potansiyelini ortaya koymaktadır. Copilot’un daha düşük performansı, bu modellerin sürekli olarak geliştirilmesi gerektiğini vurgulamaktadır. Klinik uygulamalarda yapay zekâ entegrasyonunun, uzman değerlendirmesi ve güncel ürolojik kılavuzlarla uyum içinde gerçekleştirilmesi önem arz etmektedir. CR - 1. Tyagi S, Chancellor MB. Nocturnal polyuria and nocturia. Int Urol Nephrol 2023;55:1395-401. https://doi.org/10.1007/S11255-023-03582-5 CR - 2. Weiss JP, Everaert K. Management of Nocturia and Nocturnal Polyuria. Urology 2019;133:24-33. https:// doi.org/10.1016/J.UROLOGY.2019.09.022 CR - 3. Lavadia AC, Kim JH, Yun SW, Noh T Il. Nocturia, Sleep Quality, and Mortality: A Systematic Review and MetaAnalysis. World J Mens Health 2025;43. https://doi. org/10.5534/WJMH.240237 CR - 4. Oelke M, De Wachter S, Drake MJ, Giannantoni A, Kirby M, Orme S, et al. A practical approach to the management of nocturia. Int J Clin Pract 2017;71:e13027. https://doi.org/10.5534/WJMH.240237 CR - 5. ChatGPT version 4.0 [Internet]. OpenAI [cited 2025 Apr 18]. Available from: https://chatgpt.com/ CR - 6. Gemini 1.5 Pro [Internet]. Google DeepMind [cited 2025 Apr 18]. Available from: https://gemini.google. com/ 7. Perplexity Pro [Internet]. Perplexity AI [cited 2025 Apr 18]. Available from: https://www.perplexity.ai/ CR - 8. Copilot (GPT-4-based) [Internet]. GitHub [cited 2025 Apr 18]. Available from: https://copilot.microsoft.com/ CR - 9. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ 2023;9:e48291. https://doi.org/10.2196/48291 CR - 10. Gupta R, Pedraza AM, Gorin MA, Tewari AK. Defining the Role of Large Language Models in Urologic Care and Research. Eur Urol Oncol 2024;7:1–13. https://doi.org/10.1016/j.euo.2023.07.017 CR - 11. Song H, Xia Y, Luo Z, Liu H, Song Y, Zeng X, et al. Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. J Med Syst 2023;47:1–9. https://doi.org/10.1007/S10916-023-02021-3 CR - 12. Chervenak J, Lieman H, Blanco-Breindel M, Jindal S. The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations. Fertil Steril 2023;120:575–83. https://doi.org/10.1016/J. FERTNSTERT.2023.05.151 CR - 13. Ferber D, Kather JN. Large Language Models in Urooncology. Eur Urol Oncol 2024;7:157–9. https://doi.org/10.1016/j.euo.2023.09.019 CR - 14. Şahin B, Genç YE, Doğan K, Şener TE, Şekerci ÇA, Tanıdır Y, et al. Evaluating the Performance of ChatGPT in Urology: A Comparative Study of Knowledge Interpretation and Patient Guidance. J Endourol 2024;38:799–808. https://doi.org/10.1089/ END.2023.0413 CR - 15. Hacibey I, Halis A. Assessment of artificial intelligence performance in answering questions on onabotulinum toxin and sacral neuromodulation. Investig Clin Urol 2025;66:18893. https://doi.org/10.4111/ICU.20250040 CR - 16. Joshi A, Kale S, Chandel S, Pal D. Likert scale: Explored and explained. Br J Appl Sci Technol 2015;BJAST:157. https://doi.org/10.9734/BJAST/2015/14975. CR - 17. Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 2016;15:155-63. https://doi.org/10.1016/J.JCM.2016.02.012 CR - 18. Altıntaş E, Ozkent MS, Gül M, Batur AF, Kaynar M, Kılıç Ö, et al. Comparative analysis of artificialintelligence chatbot recommendations for urolithiasis management: A study of EAU guideline compliance. French J Urol 2024;34:102666. https://doi.org/10.1016/J. FJUROL.2024.102666 CR - 19. Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 2023;21:1-4. https://doi.org/10.1186/S12967-023-04123-5 CR - 20. Caglar U, Yildiz O, Meric A, Ayranci A, Gelmis M, Sarilar O, et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. J Pediatr Urol 2024;20:26.e1-26.e5. https://doi.org/10.1016/J.JPUROL.2023.08.003 CR - 21. Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, et al. Medical foundation large language models for comprehensive text analysis and beyond. NPJ Digit Med 2025;8:11-0. https://doi.org/10.1038/S41746-025-01533-1 CR - 22. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD, et al. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. MedRxiv 2023:2023.02.02.23285399. https://doi.org/10.1101/2023.02.02.23285399 CR - 23. Lombardo R, Gallo G, Stira J, Turchi B, Santoro G, Riolo S, et al. Quality of information and appropriateness of Open AI outputs for prostate cancer. Prostate Cancer Prostatic Dis 2024;28:229-31. https://doi.org/10.1038/S41391-024-00789-0 CR - 24. Şahin MF, Ateş H, Keleş A, Özcan R, Doğan Ç, Akgül M, et al. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis. J Med Syst 2024;48:1-6. https://doi.org/10.1007/S10916-024-02056- 0 CR - 25. Shah M, Naik N, Somani BK, Hameed BMZ. Artificial intelligence (AI) in urology-Current use and future directions: An iTRUE study. Turk J Urol 2020;46:S27. https://doi.org/10.5152/TUD.2020.20117 CR - 26. de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med 2022;5:1– 13. https://doi.org/10.1038/S41746-021-00549-7 CR - 27. Saraswat D, Bhattacharya P, Verma A, Prasad VK, Tanwar S, Sharma G, et al. Explainable AI for Healthcare 5.0: Opportunities and Challenges. IEEE Access 2022;10:84486–517. https://doi.org/10.1109/ ACCESS.2022.3197671 CR - 28. Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology 2023;307:2023. https://doi.org/10.1148/RADIOL.230163 CR - 29. Almada M, Petit N. The EU AI Act: Between the rock of product safety and the hard place of fundamental rights. Common Market Law Review 2025;62:85-120. https:// doi.org/10.54648/COLA2025004 UR - https://doi.org/10.33719/nju1730282 L1 - https://dergipark.org.tr/tr/download/article-file/5004287 ER -