EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY

Serkan Şerifler; Burak Çelik; Kadir Şinasi Bulut; Fatih Gul; Kazım Bozdemir; Mehmet Ali Babademez

doi:10.54005/geneltip.1756002

TR EN

Rinoplasti Sık Sorulan Sorularına Yapay Zekâ Yanıtlarının Uzman Onaylı Karşılaştırması

Abstract

Amaç: ChatGPT-4, DeepSeek ve Gemini gibi büyük dil modelleri (LLM'ler), hasta eğitimi ve klinik karar desteği için giderek daha fazla araştırılmaktadır. Ancak, özellikle postoperatif bakımda sıkça sorulan hasta sorularına yanıt verirken, bu modellerin doğruluk, bütünlük ve okunabilirlik açısından yeterliliği hâlâ sorgulanmaktadır. Bu çalışmanın birincil amacı, üç önde gelen yapay zekâ modelini—ChatGPT-4, DeepSeek ve Gemini—postoperatif rinoplastiyle ilgili sık sorulan sorulara verdikleri yanıtların doğruluk, anlaşılırlık, ilgili olma durumu ve tamlık açısından karşılaştırmaktır. İkincil amaç ise, bu modellerin verdiği yanıtların genel hasta kitlesi için okunabilirliğini değerlendirmektir. Yöntem: Amerikan Kulak Burun Boğaz ve Baş Boyun Cerrahisi Akademisi’nin (AAO-HNS) rehberlerine dayalı olarak 14 sık sorulan soru seçilmiştir. Her bir yapay zekâ modelinin verdiği yanıtlar, 15 uzman kulak burun boğaz hekimi tarafından, 5 dereceli Likert ölçeği kullanılarak dört temel alanda değerlendirilmiştir: doğruluk, anlaşılırlık, ilgili olma ve tamlık. Okunabilirlik değerlendirmesi için Flesch Okunabilirlik Puanı (FRES) ve Flesch–Kincaid Sınıf Düzeyi (FKGL) kullanılmıştır. Modeller arasındaki farkları belirlemek amacıyla uygun istatistiksel testler uygulanmıştır. Bulgular: Uzman değerlendirmeleri, modeller arasında anlamlı performans farklılıkları olduğunu göstermiştir. DeepSeek modeli, doğruluk (p=0.00003) ve tamlık (p=0.0042) açısından ChatGPT-4 ve Gemini’ye kıyasla anlamlı şekilde düşük puan almıştır. Anlaşılırlık (p=0.52) ve ilgili olma (p=0.42) açısından ise anlamlı bir fark bulunmamıştır. Okunabilirlik puanları modeller arasında anlamlı farklılık göstermemiş olsa da, tüm yanıtlar ortalama bir hastanın tamamen anlayabileceği düzeyin üzerindeydi. Sonuç: ChatGPT-4 ve Gemini, doğruluk ve tamlık açısından DeepSeek’e kıyasla daha iyi performans göstermiştir. Ancak, değerlendirilen hiçbir yapay zekâ modeli, hasta eğitimi için gerekli olan temel okunabilirlik kriterlerini karşılayamamıştır. Bu bulgular, yapay zekâ içeriklerinin daha erişilebilir hale getirilmesi ve insan denetiminin devam etmesi gerekliliğini vurgulamaktadır. Çalışmamız, bu alanda önemli bir kıyaslama sunmakta ve gelecekteki yapay zekâ geliştirmelerinde bilgisel doğruluk kadar hasta anlayışının da önceliklendirilmesi gerektiğini ortaya koymaktadır.

Keywords

Yapay Zekâ, Büyük Dil Modelleri, Rinoplasti, Ameliyat Sonrası Bakım, Anlama

EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY

Abstract

Aim: Large language models (LLMs) such as ChatGPT-4, DeepSeek, and Gemini are increasingly explored as tools for patient education and clinical decision support. However, concerns remain regarding their factual accuracy, completeness, and readability, especially when addressing frequently asked patient questions in postoperative care. This study aimed to directly compare three leading AI models—ChatGPT-4, DeepSeek, and Gemini—in terms of their accuracy, clarity, relevance, and completeness when answering common postoperative rhinoplasty FAQs. A secondary objective was to assess the readability of these AI-generated responses for a general patient audience. Method: We selected 14 frequently asked questions based on authoritative AAO-HNS guidelines. Responses from each AI model were independently evaluated by 15 board-certified otorhinolaryngologists using a 5-point Likert scale across four domains: accuracy, clarity, relevance, and completeness. Readability was measured using the Flesch Reading Ease Score and Flesch–Kincaid Grade Level. Data were analyzed using appropriate statistical tests to identify significant differences among the models. Results: Expert evaluations showed significant performance differences among the models. DeepSeek underperformed in both accuracy (p=0.00003) and completeness (p=0.0042) compared to ChatGPT-4 and Gemini. No statistically significant differences were observed for clarity (p=0.52) or relevance (p=0.42). Although readability scores did not significantly differ across models, all responses were deemed too complex for the average patient to fully understand. Conclusion: While ChatGPT-4 and Gemini demonstrated higher accuracy and completeness than DeepSeek, none of the evaluated AI models produced content that met essential patient readability standards. These findings underscore the need for improved content accessibility and ongoing human oversight before LLMs can be reliably integrated into clinical patient education. This study establishes an important benchmark and highlights the urgency for future AI development to prioritize both factual integrity and true patient comprehension.

Keywords

Artificial Intelligence, Large Language Models, Rhinoplasty, Postoperative Care, Comprehension

References

1. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners [Preprint]. arXiv. 2020. arXiv:2005.14165.",
2. Vlad AL, Popazu C, Lescai AM, Voinescu DC, Balta AAŞ. The role of artificial intelligence in the diagnosis and management of rheumatoid arthritis. Medicina (Kaunas). 2025;61(4):689. doi:10.3390/medicina61040689.
3. Villarroel PP, Langdon C, Arancibia-Tagle D. Improving postsurgical management of rhinoplasty: a comprehensive review of existing literature. Facial Plast Surg. Epub 2025 Feb 14. doi:10.1055/a-2533-2651.
4. Serrano LP, Maita KC, Avila FR, Torres-Guzman RA, Garcia JP, Eldaly AS, et al. Benefits and challenges of remote patient monitoring as perceived by health care practitioners: a systematic review. Perm J. 2023;27(4):100–11. doi:10.7812/TPP/23.022.
5. Aliyeva A, Sari E, Alaskarov E, Nasirov R. Enhancing postoperative cochlear implant care with ChatGPT-4: a study on artificial intelligence (AI)-assisted patient education and support. Cureus. 2024;16(2):e53897. doi:10.7759/cureus.53897.
6. Ishii LE, Tollefson TT, Basura GJ, Rosenfeld RM, Abramson PJ, Chaiet SR, et al. Clinical practice guideline: improving nasal form and function after rhinoplasty. Otolaryngol Head Neck Surg. 2017;156(2 Suppl):S1–30. doi:10.1177/0194599816683153.",
7. de Vries PLM, Baud D, Baggio S, Ceulemans M, Favre G, Gerbier E, et al. Enhancing perinatal health patient information through ChatGPT – an accuracy study. PEC Innov. 2025;6:100381. doi:10.1016/j.pecinn.2025.100381.
8. Kirchner GJ, Kim RY, Weddle JB, Bible JE. Can artificial intelligence improve the readability of patient education materials? Clin Orthop Relat Res. 2023;481(11):2260–7. doi:10.1097/CORR.0000000000002668.
9. Huang Y, Shi R, Chen C, Zhou X, Zhou X, Hong J, et al. Evaluation of large language models for providing educational information in orthokeratology care. Cont Lens Anterior Eye. 2025;48(3):102384. doi:10.1016/j.clae.2025.102384.
10. Zhang B. Readability analysis of texts in college English textbooks and reading passages in CET-6. OALib. 2022;9:1–19. doi:10.4236/oalib.1109445.

11. Meyer MKR, Kandathil CK, Davis SJ, Durairaj KK, Patel PN, Pepper JP, et al. Evaluation of rhinoplasty information from ChatGPT, Gemini, and Claude for readability and accuracy. Aesthetic Plast Surg. 2025;49(7):1868–73. doi:10.1007/s00266-024-04343-0.
12. Gül F, Şerifler S, Bulut KŞ, Babademez MA. May AI robots provide accurate information about SSHL? A comparative analysis of ChatGPT and Gemini. Ann Med Res. 2024;31(9):675.
13. Maniaci A, Gagliano C, Salerno V, Cilia N, Lavalle S, Saibene AM, et al. ChatGPT 4.0 and algor in generating concept maps: an observational study. Eur Arch Otorhinolaryngol. 2025;282(5):2669–77. doi:10.1007/s00405-025-09255-6.
14. Eltorai AE, Ghanian S, Adams CA Jr, Born CT, Daniels AH. Readability of patient education materials on the American Association for Surgery of Trauma website. Arch Trauma Res. 2014;3(2):e18161. doi:10.5812/atr.18161.
15. Aliyeva A, Alaskarov E, Sari E. Postoperative management of tympanoplasty with ChatGPT-4.0. J Int Adv Otol. 2025;21(1):1–6. doi:10.5152/iao.2025.241797.
16. Topol EJ. Deep medicine: how artificial intelligence can make healthcare human again. 1st ed. New York: Basic Books; 2019.

Details

Primary Language

English

Subjects

Otorhinolaryngology

Journal Section

Clinical Research

Authors

Serkan Şerifler ^*
0000-0003-0771-7373
Türkiye

Burak Çelik
0000-0002-4708-9749
Türkiye

Kadir Şinasi Bulut
0000-0002-4145-8339
Türkiye

Fatih Gul
0000-0001-7992-0974
Türkiye

Kazım Bozdemir
0000-0001-9190-2293
Türkiye

Mehmet Ali Babademez
0000-0002-0020-6493
Türkiye

Publication Date

February 11, 2026

Submission Date

August 1, 2025

Acceptance Date

September 23, 2025

Published in Issue

Year 2026 Volume: 36 Number: 2026

DOI

https://doi.org/10.54005/geneltip.1756002

IZ

https://izlik.org/JA77WL27UG

APA

Şerifler, S., Çelik, B., Bulut, K. Ş., Gul, F., Bozdemir, K., & Babademez, M. A. (2026). EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY. Genel Tıp Dergisi, 36(2026), 1-5. https://doi.org/10.54005/geneltip.1756002

AMA

1.Şerifler S, Çelik B, Bulut KŞ, Gul F, Bozdemir K, Babademez MA. EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY. Genel Tıp Derg. 2026;36(2026):1-5. doi:10.54005/geneltip.1756002

Chicago

Şerifler, Serkan, Burak Çelik, Kadir Şinasi Bulut, Fatih Gul, Kazım Bozdemir, and Mehmet Ali Babademez. 2026. “EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY”. Genel Tıp Dergisi 36 (2026): 1-5. https://doi.org/10.54005/geneltip.1756002.

EndNote

Şerifler S, Çelik B, Bulut KŞ, Gul F, Bozdemir K, Babademez MA (February 1, 2026) EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY. Genel Tıp Dergisi 36 2026 1–5.

IEEE

[1]S. Şerifler, B. Çelik, K. Ş. Bulut, F. Gul, K. Bozdemir, and M. A. Babademez, “EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY”, Genel Tıp Derg, vol. 36, no. 2026, pp. 1–5, Feb. 2026, doi: 10.54005/geneltip.1756002.

ISNAD

Şerifler, Serkan - Çelik, Burak - Bulut, Kadir Şinasi - Gul, Fatih - Bozdemir, Kazım - Babademez, Mehmet Ali. “EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY”. Genel Tıp Dergisi 36/2026 (February 1, 2026): 1-5. https://doi.org/10.54005/geneltip.1756002.

JAMA

1.Şerifler S, Çelik B, Bulut KŞ, Gul F, Bozdemir K, Babademez MA. EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY. Genel Tıp Derg. 2026;36:1–5.

MLA

Şerifler, Serkan, et al. “EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY”. Genel Tıp Dergisi, vol. 36, no. 2026, Feb. 2026, pp. 1-5, doi:10.54005/geneltip.1756002.

Vancouver

1.Serkan Şerifler, Burak Çelik, Kadir Şinasi Bulut, Fatih Gul, Kazım Bozdemir, Mehmet Ali Babademez. EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY. Genel Tıp Derg. 2026 Feb. 1;36(2026):1-5. doi:10.54005/geneltip.1756002

The Journal of General Medicine is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY NC).

Rinoplasti Sık Sorulan Sorularına Yapay Zekâ Yanıtlarının Uzman Onaylı Karşılaştırması

Abstract

Keywords

EXPERT EVALUATION OF ARTIFICIAL INTELLIGENCE GENERATED ANSWERS TO FREQUENTLY ASKED QUESTIONS ABOUT RHINOPLASTY

Abstract

Keywords

References

Details

Primary Language

Subjects

Journal Section

Authors

Publication Date

Submission Date

Acceptance Date

Published in Issue

DOI

IZ

Cite