Benchmarking Different Natural Language Processing Models for Their Responses to Queries on Toothsupported Fixed Dental Prostheses in Terms of Accuracy and Consistency

Emine Dilara Çolpak; Deniz Yılmaz

doi:10.54617/adoklinikbilimler.1698260

Research Article

BibTex

RIS

Cite

Farklı Doğal Dil İşleme Modellerinin Doğruluk ve Tutarlılık Açısından Diş Destekli Sabit Protez Sorgulamalarında Karşılaştırılması

Year 2025, Volume: 14 Issue: 3, 215 - 223, 29.09.2025

Emine Dilara Çolpak , Deniz Yılmaz

https://doi.org/10.54617/adoklinikbilimler.1698260

Abstract

Amaç: Bu çalışmanın amacı, dört farklı yazılım programı tarafından oluşturulan diş destekli sabit diş protezlerine ilişkin yanıtların doğruluğunu ve tekrarlanabilirliğini değerlendirmektir.
Gereç ve Yöntemler: 12 adet açık uçlu olarak Türkçe dilinde hazırlanan sorular oluşturuldu ve modellere göre 4 farklı NLP'ye yöneltilmiştir: OpenAI o3 (LRM-O), OpenAI GPT 4.5 (LLM-G), DeepSeek R1 (LRM-R) ve DeepSeek V3 (LLM-V). Yanıtlar holistic rubric kullanılarak değerlendirilmiştir. Doğruluk değerlendirmeleri için Kruskal-Wallis H testi kullanılmıştır. Puanlayıcıların yanıtları arasındaki tutarlılık Brennan ve Prediger katsayısı ve Cohen kappa katsayısı kullanılarak değerlendirilmiştir. Tekrarlanabilirlik ise Fleiss kappa ve Krippendorff alfa katsayıları kullanılarak değerlendirilmiştir (p < .05).
Bulgular: LRM-O, LLM-G, LRM-R ve LLM-V grupları arasında doğruluk açısından istatistiksel olarak anlamlı bir fark bulunamamıştır (p = .298). LRM-O, LLM-G, LRM-R ve LLM-V'nin doğruluğu sırasıyla %77,7, %50, %66,6 ve %77,7dir. Ayrıca, LLM'lerin tekrarlanabilirliği neredeyse mükemmel bulunurken, LRM'ler önemli düzeydeydi.
Sonuç: Çalışanın sınırları dahilinde LRM'ler ve LLM'ler benzer doğruluk sergilemiştir. Ancak, LLM'lerin tekrarlanabilirliği LRM'lerden daha yüksek bulunmuştur.
Anahtar Kelimeler: Yapay zeka, diş protezi, tedavi protokolleri

Keywords

Yapay zeka , diş protezi , tedavi protokolleri

References

1. Kaygisiz ÖF, Teke MT. Can DeepSeek and ChatGPT be used in the diagnosis of oral pathologies? BMC Oral Health 2025;25:638.
2. Stroop A, Stroop T, Zawy Alsofy S, Wegner M, Nakamura nM, Stroop R. Assessing GPT-4’s accuracy in answering clinical pharmacological questions on pain therapy. Br J Clin Pharmacol 2025;2025:1-10.
3. Kambhampati S, Stechly K, Valmeekam K. (How) Do reasoning models reason? Ann N Y Acad Sci 2025;1547:33–40.
4. Haupt CE, Marks M. AI-generated medical advice: GPT and beyond. JAMA 2023;329:1349–50.
5. Gibney E. China’s cheap, open AI model DeepSeek thrills scientists. Nature 2025;638:13–4.
6. Hoyt RE, Knight D, Haider M, Bajwa, M. Evaluating a large reasoning models performance on open-ended medical scenarios. medRxiv 2025;2025:1-15.
7. Jiang Q, Gao Z, Karniadakis GE. DeepSeek vs. ChatGPT vs. Claude: a comparative study for scientific computing and scientific machine learning tasks. Theoretical and Applied Mechanics Letters 2025;15:100583.
8. OpenAI. Introducing OpenAI o3 and o4-mini 2025. https:// openai.com/index/introducing-o3-and-o4-mini/. Accessed 13 May 2025.
9. Egger J, De Paiva LF, Luijten G, Krittanawong C, Keyl J, Sallam M, et al. Is DeepSeek-R1 a game changer in healthcare? a seed review. TechRxiv 2025;4:1–21.
10. Sallam M, Kholoud Al-Mahzoum, Sallam M, Mijwil MM. DeepSeek: is it the end of generative AI monopoly or the mark of the impending doomsday? Mesopotamian Journal of Big Data 2025;2025:26–34.
11. Normile D. Chinese firm’s large language model makes a splash. Science 2025;387:238.
12. Eggmann F, Weiger R, Zitzmann NU, Blatz MB. Implications of large language models such as ChatGPT for dental medicine. J Esthet Restor Dent. 2023;35:1098-1102.
13. Özcivelek T, Özcan B. Comparative evaluation of responses from DeepSeek-R1, ChatGPT-o1, ChatGPT-4, and dental GPT chatbots to patient inquiries about dental and maxillofacial prostheses. BMC Oral Health. 2025;25:871.
14. Cuevas-Nunez M, Silberberg VIA, Arregui M, Jham B, Baallester-Victoria R, Koptseva I, et al. Diagnostic performance of ChatGPT-4.0 in histopathological description analysis of oral and maxillofacial lesions: a comparative study with pathologists. Oral Surg Oral Med Oral Pathol Oral Radiol. 2025;139:453–61.
15. Shirani M. Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time. J Prosthet Dent. 2025;S0022-3913:00400-7.
16. Diniz-Freitas M, Diz-Dios P. DeepSeek: another step forward in the diagnosis of oral lesions. J Dent Sci Epub 2025.
17. Zhou M, Pan Y, Zhang Y, Song X, Zhou Y. Evaluating AIgenerated patient education materials for spinal surgeries: comparative analysis of readability and DISCERN quality across ChatGPT and DeepSeek models. Int J Med Inform 2025;198:105871.
18. Hou Y, Patel J, Dai L, Zhang E, Liu Y, Zhan Z, et al. Benchmarking of Large Language Models for the Dental Admission Test. Health Data Sci 2025;5:0250.
19. British Society for Restorative Dentistry. Crowns, fixed bridges and dental implants: guidelines. Woodford Green: British Society for Restorative Dentistry; 2013. p. 8-21.
20. Zaghir J, Naguib M, Bjelogrlic M, Névéol A, Tannier X, Lovis C. Prompt engineering paradigms for medical applications: scoping review. J Med Internet Res 2024;26:e60501.
21. Koçak D. Investigation of rater tendencies and reliability in different assessment methods with the many facet Rasch model. IEJEE. 2020;12:349–58.
22. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: evaluating the consistency and accuracy of endodontic question answers. Int Endod J 2024;57:108–13.
23. Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, et al. Beyond the scalpel: assessing ChatGPT’s potential as an auxiliary intelligentvirtual assistant in oral surgery. Comput Struct Biotechnol J 2023;24:46–52.
24. Gwet KL. Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. 4th ed. Gaithersburg, Advanced Analytics LLC; 2014. p. 163-183.
25. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna 2021. https://www.R-project.org. Accessed 13 May 2025
26. Freire Y, Santamaría Laorden A, Orejas Pérez J, Gómez Sánchez M, Díaz-Flores García V, Suárez A. ChatGPT performance in prosthodontics: assessment of accuracy and repeatability in answer generation. J Prosthet Dent 2024;131:659. e1–6.
27. Cinar C. Analyzing the performance of ChatGPT about osteoporosis. Cureus 2023;15:e45890.
28. Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 2022;35:24824–37.
29. Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, et al. Self-consistency improves chain-of-thought reasoning in language models. ArXiv 2023;1–24.
30. Gheisarifar M, Shembesh M, Koseoglu M, Fang Q, Afshari FS, Yuan JC, Sukotjo C. Evaluating the validity and consistency of artificial intelligence chatbots in responding to patients’ frequently asked questions in prosthodontics. J Prosthet Dent. 2025;134:199-206.
31. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 2023;15:e35179.

Benchmarking Different Natural Language Processing Models for Their Responses to Queries on Toothsupported Fixed Dental Prostheses in Terms of Accuracy and Consistency

Year 2025, Volume: 14 Issue: 3, 215 - 223, 29.09.2025

Emine Dilara Çolpak , Deniz Yılmaz

https://doi.org/10.54617/adoklinikbilimler.1698260

Abstract

Aim: This study aimed to evaluate the accuracy and repeatability of responses generated by four different software programs regarding tooth-supported fixed dental prostheses.
Materials and Method: Twelve open-ended questions in Turkish were created and posed to four different NLPs according to the following models: OpenAI o3 (LRM-O), OpenAI GPT 4.5 (LLM-G), DeepSeek R1 (LRM-R), and DeepSeek V3 (LLM-V) with pre-prompts in the morning, afternoon, and evening. The responses were evaluated with a holistic rubric. For accuracy assessments, the Kruskal–Wallis H test was used. Consistency between the graders’ responses was assessed using the Brennan and Prediger coefficient and the Cohen kappa coefficient. Repeatability was assessed using the Fleiss kappa and Krippendorff alpha coefficients (p < 0.05).
Results: There was no statistically significant difference in accuracy between the LRM-O, LLM-G, LRM-R, and LLM-V groups (p = 0.298). The respective accuracies of LRM-O, LLM-G, LRM-R, and LLM-V were 77.7%, 50%, 66.6%, and 77.7%. In addition, the repeatability of LLMs was found to be almost perfect, whereas that of LRMs was substantial.
Conclusion: Within the limitations of the study, LRMs and LLMs exhibited similar accuracy. However, the repeatability of LLMs was higher than that of LRMs.
Keywords: Artificial intelligence, Dental prostheses, Treatment protocols

Keywords

Artificial intelligence , Dental prostheses , Treatment protocols

References

1. Kaygisiz ÖF, Teke MT. Can DeepSeek and ChatGPT be used in the diagnosis of oral pathologies? BMC Oral Health 2025;25:638.
2. Stroop A, Stroop T, Zawy Alsofy S, Wegner M, Nakamura nM, Stroop R. Assessing GPT-4’s accuracy in answering clinical pharmacological questions on pain therapy. Br J Clin Pharmacol 2025;2025:1-10.
3. Kambhampati S, Stechly K, Valmeekam K. (How) Do reasoning models reason? Ann N Y Acad Sci 2025;1547:33–40.
4. Haupt CE, Marks M. AI-generated medical advice: GPT and beyond. JAMA 2023;329:1349–50.
5. Gibney E. China’s cheap, open AI model DeepSeek thrills scientists. Nature 2025;638:13–4.
6. Hoyt RE, Knight D, Haider M, Bajwa, M. Evaluating a large reasoning models performance on open-ended medical scenarios. medRxiv 2025;2025:1-15.
7. Jiang Q, Gao Z, Karniadakis GE. DeepSeek vs. ChatGPT vs. Claude: a comparative study for scientific computing and scientific machine learning tasks. Theoretical and Applied Mechanics Letters 2025;15:100583.
8. OpenAI. Introducing OpenAI o3 and o4-mini 2025. https:// openai.com/index/introducing-o3-and-o4-mini/. Accessed 13 May 2025.
9. Egger J, De Paiva LF, Luijten G, Krittanawong C, Keyl J, Sallam M, et al. Is DeepSeek-R1 a game changer in healthcare? a seed review. TechRxiv 2025;4:1–21.
10. Sallam M, Kholoud Al-Mahzoum, Sallam M, Mijwil MM. DeepSeek: is it the end of generative AI monopoly or the mark of the impending doomsday? Mesopotamian Journal of Big Data 2025;2025:26–34.
11. Normile D. Chinese firm’s large language model makes a splash. Science 2025;387:238.
12. Eggmann F, Weiger R, Zitzmann NU, Blatz MB. Implications of large language models such as ChatGPT for dental medicine. J Esthet Restor Dent. 2023;35:1098-1102.
13. Özcivelek T, Özcan B. Comparative evaluation of responses from DeepSeek-R1, ChatGPT-o1, ChatGPT-4, and dental GPT chatbots to patient inquiries about dental and maxillofacial prostheses. BMC Oral Health. 2025;25:871.
14. Cuevas-Nunez M, Silberberg VIA, Arregui M, Jham B, Baallester-Victoria R, Koptseva I, et al. Diagnostic performance of ChatGPT-4.0 in histopathological description analysis of oral and maxillofacial lesions: a comparative study with pathologists. Oral Surg Oral Med Oral Pathol Oral Radiol. 2025;139:453–61.
15. Shirani M. Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time. J Prosthet Dent. 2025;S0022-3913:00400-7.
16. Diniz-Freitas M, Diz-Dios P. DeepSeek: another step forward in the diagnosis of oral lesions. J Dent Sci Epub 2025.
17. Zhou M, Pan Y, Zhang Y, Song X, Zhou Y. Evaluating AIgenerated patient education materials for spinal surgeries: comparative analysis of readability and DISCERN quality across ChatGPT and DeepSeek models. Int J Med Inform 2025;198:105871.
18. Hou Y, Patel J, Dai L, Zhang E, Liu Y, Zhan Z, et al. Benchmarking of Large Language Models for the Dental Admission Test. Health Data Sci 2025;5:0250.
19. British Society for Restorative Dentistry. Crowns, fixed bridges and dental implants: guidelines. Woodford Green: British Society for Restorative Dentistry; 2013. p. 8-21.
20. Zaghir J, Naguib M, Bjelogrlic M, Névéol A, Tannier X, Lovis C. Prompt engineering paradigms for medical applications: scoping review. J Med Internet Res 2024;26:e60501.
21. Koçak D. Investigation of rater tendencies and reliability in different assessment methods with the many facet Rasch model. IEJEE. 2020;12:349–58.
22. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: evaluating the consistency and accuracy of endodontic question answers. Int Endod J 2024;57:108–13.
23. Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, et al. Beyond the scalpel: assessing ChatGPT’s potential as an auxiliary intelligentvirtual assistant in oral surgery. Comput Struct Biotechnol J 2023;24:46–52.
24. Gwet KL. Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. 4th ed. Gaithersburg, Advanced Analytics LLC; 2014. p. 163-183.
25. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna 2021. https://www.R-project.org. Accessed 13 May 2025
26. Freire Y, Santamaría Laorden A, Orejas Pérez J, Gómez Sánchez M, Díaz-Flores García V, Suárez A. ChatGPT performance in prosthodontics: assessment of accuracy and repeatability in answer generation. J Prosthet Dent 2024;131:659. e1–6.
27. Cinar C. Analyzing the performance of ChatGPT about osteoporosis. Cureus 2023;15:e45890.
28. Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 2022;35:24824–37.
29. Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, et al. Self-consistency improves chain-of-thought reasoning in language models. ArXiv 2023;1–24.
30. Gheisarifar M, Shembesh M, Koseoglu M, Fang Q, Afshari FS, Yuan JC, Sukotjo C. Evaluating the validity and consistency of artificial intelligence chatbots in responding to patients’ frequently asked questions in prosthodontics. J Prosthet Dent. 2025;134:199-206.
31. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 2023;15:e35179.

There are 31 citations in total.

Details

Primary Language	English
Subjects	Prosthodontics
Journal Section	Research Article
Authors	Emine Dilara Çolpak 0000-0002-5334-2421 Deniz Yılmaz 0000-0003-4570-9067
Publication Date	September 29, 2025
Submission Date	May 13, 2025
Acceptance Date	August 2, 2025
Published in Issue	Year 2025 Volume: 14 Issue: 3

Cite

Vancouver	Çolpak ED, Yılmaz D. Benchmarking Different Natural Language Processing Models for Their Responses to Queries on Toothsupported Fixed Dental Prostheses in Terms of Accuracy and Consistency. ADO Klinik Bilimler Dergisi. 2025;14(3):215-23.

Download Cover Image

Article Files

Full Text