Evaluation of accuracy, clinical reliability and readability of LLM-based chatbot responses in prosthetic dentistry FAQs

Bülent Kadir Tartuk; Eyyüp Altıntaş

Evaluation of accuracy, clinical reliability and readability of LLM-based chatbot responses in prosthetic dentistry FAQs

Abstract

Aims: Evidence comparing multiple contemporary large language model (LLM)-based chatbots in prosthetic dentistry using multidimensional outcome measures remains limited. This study comparatively evaluated the responses generated by ChatGPT, Gemini, Copilot and DeepSeek to frequently asked questions (FAQs) related to prosthetic dentistry in terms of accuracy, clinical reliability and readability. Methods: Thirty-nine FAQs obtained from publicly available patient education resources were equally distributed across fixed, removable and implant-supported prosthesis categories (n=13 each). Questions were submitted in Turkish on the same day under standardized conditions to ChatGPT, Gemini, Copilot and DeepSeek Chatbots, all of which were accessed through their publicly available web interfaces. Responses generated in Turkish were independently scored by three prosthodontists using five-point Likert scales to assess accuracy and clinical reliability. Readability was assessed using the Ateşman and Bezirci-Yılmaz formulas. Inter-rater agreement was analyzed using the intraclass correlation coefficient (ICC). Repeated-measures comparisons were performed using the Friedman test, followed by Bonferroni-adjusted pairwise Wilcoxon signed-rank tests. Effect sizes were reported using Kendall’s W. Results: Inter-rater agreement was high for accuracy (ICC=0.86) and clinical reliability (ICC=0.83). Significant inter-system differences were observed in accuracy, clinical reliability, and readability outcomes (all p<0.001; Kendall’s W=0.31-0.46). ChatGPT demonstrated the highest accuracy and most favorable readability values, whereas Gemini showed the highest clinical reliability scores. Copilot and DeepSeek generally exhibited lower performances. Implant-related questions yielded significantly lower accuracy and reliability scores than fixed and removable prosthesis questions (p<0.05). Conclusion: LLM-based chatbots demonstrated heterogeneous performance in answering questions related to prosthetic dentistry. Although some systems may assist preliminary patient education, meaningful differences in clinical reliability and readability indicate that chatbot outputs should be interpreted cautiously and reviewed by dental professionals, particularly for implant-related topics.

Keywords

Supporting Institution

The authors received no financial support for the conduct or publication of this research.

Ethical Statement

This article does not require ethics committee approval as it does not involve human or animal studies

References

Molena KF, Macedo AP, Ijaz A, et al. Assessing the accuracy, completeness, and reliability of artificial intelligence-generated responses in dentistry: a pilot study evaluating the ChatGPT model. Cureus. 2024;16(7):e65658. doi:10.7759/cureus.65658
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 28 2015;521(7553): 436-444. doi:10.1038/nature14539
Cook MJ, Yao L, Wang X. Facilitating accurate health provider directories using natural language processing. BMC Med Inform Decis Mak. 2019; 19(Suppl 3):80. doi:10.1186/s12911-019-0788-x
Iannantuono GM, Bracken-Clarke D, Floudas CS, Roselli M, Gulley JL, Karzai F. Applications of large language models in cancer care: current evidence and future perspectives. Front Oncol. 2023;13:1268915. doi:10. 3389/fonc.2023.1268915
Zhang P, Kamel Boulos MN. Generative AI in medicine and healthcare: promises, opportunities and challenges. Future Internet. 2023;15(9):286. doi:10.3390/fi15090286
Sallam M. Reply to Moreno et al. Comment on "Sallam, M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare. 2023; 11(22):2955. doi:10.3390/healthcare11222955
Deiana G, Dettori M, Arghittu A, Azara A, Gabutti G, Castiglia P. Artificial intelligence and public health: evaluating ChatGPT responses to vaccination myths and misconceptions. Vaccines. 2023;11(7):1217. doi:10.3390/vaccines11071217
Alhur A. Redefining healthcare with artificial intelligence, this study examines the contributions of ChatGPT, Gemini, and Copilot. Cureus. 2024;16(4):e57795. doi:10.7759/cureus.57795

Prasad S, Koseoglu M, Antonopoulou S, et al. Readability and performance of AI chatbot responses to frequently asked questions in maxillofacial prosthodontics. J Prosthet Dent. 2026;135(1):195.e1-195.e9. doi:10.1016/j.prosdent.2025.09.009
Roumeliotis KI, Tselikas ND. Chatgpt and open-ai models: a preliminary review. Future Internet. 2023;15(6):192. doi:10.3390/fi15060192
Rane N, Choudhary S, Rane J. Gemini versus ChatGPT: applications, performance, architecture, capabilities, and implementation. J Appl Artif Intell. 2024;5(1):69-93. doi:10.48185/jaai.v5i1.1052
Tepe M, Emekli E. Decoding medical jargon: The use of AI language models (ChatGPT-4, BARD, microsoft copilot) in radiology reports. Patient Educ Couns. 2024;126:108307. doi:10.1016/j.pec.2024.108307
Tordjman M, Liu Z, Yuce M, et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med. 2025;31(8):2550-2555. doi:10.1038/s41591-025-03726-3
Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses, this study examines answers to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589-596. doi:10.1001/jamainternmed.2023.1838
Gheisarifar M, Shembesh M, Koseoglu M, et al. Evaluating the validity and consistency of artificial intelligence chatbots in responding to patients' frequently asked questions in prosthodontics. J Prosthet Dent. 2025;134(1):199-206. doi:10.1016/j.prosdent.2025.03.009
Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: evaluating the consistency and accuracy of endodontic question answers. Int Endod J. 2024;57(1):108-113. doi:10.1111/iej.13985
Sondell K, Söderfeldt B, Palmqvist S. Dentist-patient communication and patient satisfaction in prosthetic dentistry. Int J Prosthodont. 2002; 15(1):28-37.
Brozović J, Mikulić B, Tomas M, Juzbašić M, Blašković M. Assessing the performance of Bing Chat artificial intelligence: dental exams, clinical guidelines, and patients' frequent questions. J Dent. 2024;144:104927. doi:10.1016/j.jdent.2024.104927 Esmailpour H, Rasaie V, Babaee Hemmati Y, Falahchai M. Performance of artificial intelligence chatbots in responding to the frequently asked questions of patients regarding dental prostheses. BMC Oral Health. 2025;25(1):574. doi:10.1186/s12903-025-05965-9
Wang Y, Zhao Y, Petzold L. Are large language models ready for healthcare? a comparative study on clinical language understanding. MLHC. 2023:804-823. doi:10.48550/arXiv.2304.05368
Goodacre CJ, Bernal G, Rungcharassaeng K, Kan JY. Clinical complications in fixed prosthodontics. J Prosthet Dent. 2003;90(1):31-41. doi:10.1016/s0022-3913(03)00214-2
Pjetursson BE, Brägger U, Lang NP, Zwahlen M. Comparison of survival and complication rates of tooth-supported fixed dental prostheses (FDPs) and implant-supported FDPs and single crowns (SCs). Clin Oral Implants Res. 2007;18 Suppl 3:97-113. doi:10.1111/j.1600-0501.2007. 01439.x
Go E-J, Lee Y-H, Park K-H. A study for middle-aged on oral health knowledge, oral health care and satisfaction with prosthetic treatment. J Korean Soc Dent Hyg. 2011;11(5):671-683.
Shen SA, Perez-Heydrich CA, Xie DX, Nellis JC. ChatGPT vs. web search for patient questions: what does ChatGPT do better? Eur Arch Otorhinolaryngol. 2024;281(6):3219-3225. doi:10.1007/s00405-024-08524-0
Faul F, Erdfelder E, Buchner A, Lang AG. Statistical power analyses using G*Power 3.1: tests for correlation and regression analyses. Behav Res Methods. 2009;41(4):1149-60. doi:10.3758/brm.41.4.1149 Ateşman E. Measuring readability in Turkish. AU Tömer Lang J. 1997; 58(2):171-174.
Bezirci B, Yılmaz AE. Metinlerin okunabilirliğinin ölçülmesi üzerine bir yazilim kütüphanesi ve Türkçe için yeni bir okunabilirlik ölçütü. DEÜ FMD. 2010;12(3):49-62.
Helvacioglu-Yigit D, Demirturk H, Ali K, Tamimi D, Koenig L, Almashraqi A. Evaluating artificial intelligence chatbots for patient education in oral and maxillofacial radiology. Oral Surg Oral Med Oral Pathol Oral Radiol. 2025;139(6):750-759. doi:10.1016/j.oooo.2025.01.001
Wu Y, Zhang Y, Xu M, Jinzhi C, Xue Y, Zheng Y. Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study. BMC Med Inform Decis Mak. 2025;25(1):147. doi:10.1186/s12911-025-02972-2
Freire Y, Santamaría Laorden A, Orejas Pérez J, Gómez Sánchez M, Díaz-Flores García V, Suárez A. ChatGPT performance in prosthodontics: assessment of accuracy and repeatability in answer generation. J Prosthet Dent. 2024;131(4):659.e1-659.e6. doi:10.1016/j.prosdent.2024.01.018
Dashti M, Londono J, Ghasemi S, Moghaddasi N. How much can we rely on artificial intelligence chatbots such as the ChatGPT software program to assist with scientific writing? J Prosthet Dent. 2025;133(4):1082-1088. doi:10.1016/j.prosdent.2023.05.023
Tuzlalı M, Baki N, Aral K, Aral CA, Bahçe E. Evaluating the performance of AI chatbots in responding to dental implant FAQs: a comparative study. BMC Oral Health. 2025;25(1):1548. doi:10.1186/s12903-025-06863-w
Martin WC, Pollini A, Morton D. The influence of restorative procedures on esthetic outcomes in implant dentistry: a systematic review. Int J Oral Maxillofac Implants. 2014;29 Suppl:142-154. doi:10.11607/jomi.2014suppl.g3.1
Zhou Y, Moon C, Szatkowski J, Moore D, Stevens J. Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis. Eur J Orthop Surg Traumatol. 2024;34(2): 927-955. doi:10.1007/s00590-023-03742-4
Sridharan K, Sivaramakrishnan G. Investigating the capabilities of advanced large language models in generating patient instructions and patient educational material. Eur J Hosp Pharm. 2025;32(6):501-507. doi: 10.1136/ejhpharm-2024-004245
Yüceer-Çetiner E, Kazan D, Nesiri M, Basa S. Evaluating the competence of AI chatbots in answering patient-oriented frequently asked questions on orthognathic surgery. Healthcare. 2025;13(17):2114. doi:10.3390/healthcare13172114
Rokhshad R, Khoury ZH, Mohammad-Rahimi H, et al. Efficacy and empathy of AI chatbots in answering frequently asked questions on oral oncology. Oral Surg Oral Med Oral Pathol Oral Radiol. 2025;139(6):719-728. doi:10.1016/j.oooo.2024.12.028
Dunnett J, Holkham J, Trebacz A, et al. Effectiveness and acceptability of interventions to improve readability of patient healthcare materials: a narrative systematic review. Public Health. 2025;248:105937. doi:10.1016/j.puhe.2025.105937
Barnett T, Hoang H, Furlan A. An analysis of the readability characteristics of oral health information literature available to the public in Tasmania, Australia. BMC Oral Health. 2016;16:35. doi:10.1186/s12903-016-0196-x
Ghanem YK, Rouhi AD, Al-Houssan A, et al. Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis. Surg Endosc. 2024;38(5):2887-2893. doi:10.1007/s00464-024-10739-5
Erkan AEG, Arslan AB. Evaluation of the responses from different chatbots to frequently asked patient questions about impacted canines. Australas Orthod J. 2025;41(1):288-300. doi:10.2478/aoj-2025-0020
Sørensen K, Van den Broucke S, Fullam J, et al. Health literacy and public health: a systematic review and integration of definitions and models. BMC Public Health. 2012;12:80. doi:10.1186/1471-2458-12-80
Ueda D, Kakinuma T, Fujita S, et al. Fairness of artificial intelligence in healthcare: review and recommendations. Jpn J Radiol. 2024;42(1):3-15. doi:10.1007/s11604-023-01474-3
Coskun BN, Yagiz B, Ocakoglu G, Dalkilic E, Pehlivan Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int. 2024;44(3):509-515. doi:10.1007/s00296-023-05473-5

Details

Primary Language

English

Subjects

Dental Public Health

Journal Section

Research Article

Authors

Bülent Kadir Tartuk ^*
0000-0003-2282-8944
Türkiye

Eyyüp Altıntaş
0000-0002-7767-9694
Türkiye

Publication Date

May 22, 2026

Submission Date

March 9, 2026

Acceptance Date

May 2, 2026

Published in Issue

Year 2026 Volume: 8 Number: 3

IZ

https://izlik.org/JA49FE48DR

Cite

RIS / Bibtex

APA

Tartuk, B. K., & Altıntaş, E. (2026). Evaluation of accuracy, clinical reliability and readability of LLM-based chatbot responses in prosthetic dentistry FAQs. Anatolian Current Medical Journal, 8(3), 513-523. https://izlik.org/JA49FE48DR

AMA

1.Tartuk BK, Altıntaş E. Evaluation of accuracy, clinical reliability and readability of LLM-based chatbot responses in prosthetic dentistry FAQs. Anatolian Curr Med J / ACMJ / acmj. 2026;8(3):513-523. https://izlik.org/JA49FE48DR

Chicago

Tartuk, Bülent Kadir, and Eyyüp Altıntaş. 2026. “Evaluation of Accuracy, Clinical Reliability and Readability of LLM-Based Chatbot Responses in Prosthetic Dentistry FAQs”. Anatolian Current Medical Journal 8 (3): 513-23. https://izlik.org/JA49FE48DR.

EndNote

Tartuk BK, Altıntaş E (May 1, 2026) Evaluation of accuracy, clinical reliability and readability of LLM-based chatbot responses in prosthetic dentistry FAQs. Anatolian Current Medical Journal 8 3 513–523.

IEEE

[1]B. K. Tartuk and E. Altıntaş, “Evaluation of accuracy, clinical reliability and readability of LLM-based chatbot responses in prosthetic dentistry FAQs”, Anatolian Curr Med J / ACMJ / acmj, vol. 8, no. 3, pp. 513–523, May 2026, [Online]. Available: https://izlik.org/JA49FE48DR

ISNAD

Tartuk, Bülent Kadir - Altıntaş, Eyyüp. “Evaluation of Accuracy, Clinical Reliability and Readability of LLM-Based Chatbot Responses in Prosthetic Dentistry FAQs”. Anatolian Current Medical Journal 8/3 (May 1, 2026): 513-523. https://izlik.org/JA49FE48DR.

JAMA

1.Tartuk BK, Altıntaş E. Evaluation of accuracy, clinical reliability and readability of LLM-based chatbot responses in prosthetic dentistry FAQs. Anatolian Curr Med J / ACMJ / acmj. 2026;8:513–523.

MLA

Tartuk, Bülent Kadir, and Eyyüp Altıntaş. “Evaluation of Accuracy, Clinical Reliability and Readability of LLM-Based Chatbot Responses in Prosthetic Dentistry FAQs”. Anatolian Current Medical Journal, vol. 8, no. 3, May 2026, pp. 513-2, https://izlik.org/JA49FE48DR.

Vancouver

1.Bülent Kadir Tartuk, Eyyüp Altıntaş. Evaluation of accuracy, clinical reliability and readability of LLM-based chatbot responses in prosthetic dentistry FAQs. Anatolian Curr Med J / ACMJ / acmj [Internet]. 2026 May 1;8(3):513-2. Available from: https://izlik.org/JA49FE48DR