Evaluation of the Competency of Large Language Models GPT-4o and Claude 3.5 Sonnet in Endodontic Emergencies

Merve Sarı; Pelin Tufenkci

doi:10.52037/eads.2025.0002

Research Article

Year 2025, Volume: 52 Issue: 1, 10 - 16, 30.04.2025

Merve Sarı , Pelin Tufenkci

https://doi.org/10.52037/eads.2025.0002

Abstract

References

Li ZQ, Wang XF, Liu JP. Publication Trends and Hot Spots of ChatGPT’s Application in the Medicine. J Med Syst. 2024;48(1):52. doi:10.1007/s10916-024-02074-y.
2. Jeon SJ, Yun JP, Yeom HG, Shin WS, Lee JH, Jeong SH, et al. Deep-learning for predicting C-shaped canals in mandibular second molars on panoramic radiographs. Dentomaxillofac Radiol. 2021;50(5):20200513. doi:10.1259/dmfr.20200513.
3. Brignardello-Petersen R. Artificial intelligence system seems to be able to detect a high proportion of periapical lesions in cone-beam computed tomographic images. J Am Dent Assoc. 2020;151(9):e83. doi:10.1016/j.adaj.2020.04.006.
4. Saghiri MA, Garcia-Godoy F, Gutmann JL, Lotfi M, Asgar K. The reliability of artificial neural network in locating minor apical foramen: a cadaver study. J Endod. 2012;38(8):1130–4. doi:10.1016/j.joen.2012.05.004.
5. Fukuda M, Inamoto K, Shibata N, Ariji Y, Yanashita Y, Kutsuna S, et al. Evaluation of an artificial intelligence system for detecting vertical root fracture on panoramic radiography. Oral Radiol. 2020;36(4):337–343. doi:10.1007/s11282-019-00409-x.
6. Ghanem YK, Rouhi AD, Al-Houssan A, Saleh Z, Moccia MC, Joshi H, et al. Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical infor- mation on appendicitis. Surg Endosc. 2024;38(5):2887–2893. doi:10.1007/s00464-024-10739-5.
7. Kanthavel R, Anathajothi K, Balamurugan S, Ganesh RK. Arti- ficial Intelligent Techniques for Wireless Communication and Networking. John Wiley & Sons; 2022.
8. OpenAI. Hello GPT-4o [Web Page]; 2024. Available from: https: //openai.com/index/hello-gpt-4o/.
9. LiveBench [Web Page]; 2024. Available from: :https:// livebench.ai/.
10. Nouroloyouni A, Nazi Y, Mikaieli Xiavi H, Noorolouny S, Kuzekanani M, Plotino G, et al. Cone-Beam Computed To- mography Assessment of Prevalence of Procedural Errors in Maxillary Posterior Teeth. Biomed Res Int. 2023;2023:4439890. doi:10.1155/2023/4439890.
11. Johnsen I, Bårdsen A, Haug SR. Impact of Case Diffi- culty, Endodontic Mishaps, and Instrumentation Method on Endodontic Treatment Outcome and Quality of Life: A Four-Year Follow-up Study. J Endod. 2023;49(4):382–389. doi:10.1016/j.joen.2023.01.005.
12. Vaishya R, Misra A, Vaish A. ChatGPT: Is this version good for healthcare and research? Diabetes Metab Syndr. 2023;17(4):102744. doi:10.1016/j.dsx.2023.102744.
13. Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T. Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases. JMIR Form Res. 2024;8:e59267. doi:10.2196/59267.
14. Howard A, Hope W, Gerada A. ChatGPT and antimicrobial ad- vice: the end of the consulting infection doctor? Lancet Infect Dis. 2023;23(4):405–406. doi:10.1016/s1473-3099(23)00113-5.
15. Sallam M. ChatGPT Utility in Healthcare Education, Re- search, and Practice: Systematic Review on the Promising Per- spectives and Valid Concerns. Healthcare (Basel). 2023;11(6). doi:10.3390/healthcare11060887.
16. Manohar N, Prasad SS. Use of ChatGPT in Academic Publishing: A Rare Case of Seronegative Systemic Lupus Erythematosus in a Patient With HIV Infection. Cureus. 2023;15(2):e34616. doi:10.7759/cureus.34616.
17. Ramezanzade S, Laurentiu T, Bakhshandah A, Ibragimov B, Kvist T, Bjørndal L. The efficiency of artificial intelligence methods for finding radiographic features in different endodon- tic treatments - a systematic review. Acta Odontol Scand. 2023;81(6):422–435. doi:10.1080/00016357.2022.2158929.
18. Benary M, Wang XD, Schmidt M, Soll D, Hilfen- haus G, Nassir M, et al. Leveraging Large Lan- guage Models for Decision Support in Personalized Oncology. JAMA Netw Open. 2023;6(11):e2343689. doi:10.1001/jamanetworkopen.2023.43689.
19. Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopou- los V, Kaklamanos EG. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Den- tistry: Comparative Mixed Methods Study. J Med Internet Res. 2023;25:e51580. doi:10.2196/51580.
20. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phe- nomenon: Evaluating the consistency and accuracy of en- dodontic question answers. Int Endod J. 2024;57(1):108–113. doi:10.1111/iej.13985.
21. Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol. 2024. doi:10.1111/edt.12965.
22. Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388(13):1233– 1239. doi:10.1056/NEJMsr2214184.
23. Betzler BK, Chen H, Cheng CY, Lee CS, Ning G, Song SJ, et al. Large language models and their impact in ophthalmology. Lancet Digit Health. 2023;5(12):e917–e924. doi:10.1016/s2589- 7500(23)00201-7.
24. Bae J, Kwon S, Myeong SJE. Enhancing Software Code Vul- nerability Detection Using GPT-4o and Claude-3.5 Sonnet: A Study on Prompt Engineering Techniques. Electronics. 2024;13(13):2657. doi:10.3390/electronics13132657.
25. Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, et al. A survey on large language model based autonomous agents. Front Com- put Sci. 2024;18(6):186345. doi:10.1007/s11704-024-40231-1.
26. Suvarna A, Khandelwal H, Peng N. PhonologyBench: Eval- uating Phonological Skills of Large Language Models. ACL. 2024:1–14. doi:10.18653/v1/2024.knowllm-1.1.

Evaluation of the Competency of Large Language Models GPT-4o and Claude 3.5 Sonnet in Endodontic Emergencies

Year 2025, Volume: 52 Issue: 1, 10 - 16, 30.04.2025

Merve Sarı , Pelin Tufenkci

https://doi.org/10.52037/eads.2025.0002

Abstract

Purpose: This study aimed to evaluate the accuracy and comprehensiveness of the responses generated by GPT-4o and Claude-3.5 Sonnet to the most frequently asked questions about endodontic emergencies.
Materials and Methods: The most frequently asked questions about nine different topics (inferior alveolar nerve block, sodium hypochlorite accidents, aspiration of dental materials, separated instruments, perforation, transportation, Ca(OH)2 extrusion, root filling, and flare-up) in endodontics were generated by GPT 3.5. Each question was asked to the both GPT-4o and Claude 3.5 Sonnet. Two authors independently scored the responses. Accuracy and comprehensiveness were assessed for each question using Likert scales. The data were statistically analyzed using the Mann‒Whitney U test, the Kruskal‒Wallis test. Significance level was set at 0.05.
Results: Responses generated by both GPT-4o and Claude 3.5 Sonnet to a total of 81 open-ended questions were evaluated. The two models yielded similar results in terms of accuracy and comprehensiveness (p > 0.05). The topics of root filling, perforation, and flare-up have the lowest accuracy scores; and root filling and separated instruments have the lowest comprehensiveness scores for GPT-4o (p < 0.05). The accuracy of Claude 3.5's responses did not show significant differences between the topics (p > 0.05); however, separated instruments had the lowest comprehensiveness scores (p < 0.05).
Conclusion: The accuracy and comprehensiveness scores of GPT-4 and Claude 3.5 Sonnet are statistically similar. Despite the high levels of accuracy and comprehensiveness shown by GPT-4o and Claude 3.5 Sonnet, they do not yet have the effect of replacing the operator in endodontic procedures.

Keywords

Artificial intelligence , Claude 3.5 , endodontic emergencies , GPT-4o

Ethical Statement

Not applicable.

Supporting Institution

Not applicable.

Thanks

Not applicable.

References

Li ZQ, Wang XF, Liu JP. Publication Trends and Hot Spots of ChatGPT’s Application in the Medicine. J Med Syst. 2024;48(1):52. doi:10.1007/s10916-024-02074-y.
2. Jeon SJ, Yun JP, Yeom HG, Shin WS, Lee JH, Jeong SH, et al. Deep-learning for predicting C-shaped canals in mandibular second molars on panoramic radiographs. Dentomaxillofac Radiol. 2021;50(5):20200513. doi:10.1259/dmfr.20200513.
3. Brignardello-Petersen R. Artificial intelligence system seems to be able to detect a high proportion of periapical lesions in cone-beam computed tomographic images. J Am Dent Assoc. 2020;151(9):e83. doi:10.1016/j.adaj.2020.04.006.
4. Saghiri MA, Garcia-Godoy F, Gutmann JL, Lotfi M, Asgar K. The reliability of artificial neural network in locating minor apical foramen: a cadaver study. J Endod. 2012;38(8):1130–4. doi:10.1016/j.joen.2012.05.004.
5. Fukuda M, Inamoto K, Shibata N, Ariji Y, Yanashita Y, Kutsuna S, et al. Evaluation of an artificial intelligence system for detecting vertical root fracture on panoramic radiography. Oral Radiol. 2020;36(4):337–343. doi:10.1007/s11282-019-00409-x.
6. Ghanem YK, Rouhi AD, Al-Houssan A, Saleh Z, Moccia MC, Joshi H, et al. Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical infor- mation on appendicitis. Surg Endosc. 2024;38(5):2887–2893. doi:10.1007/s00464-024-10739-5.
7. Kanthavel R, Anathajothi K, Balamurugan S, Ganesh RK. Arti- ficial Intelligent Techniques for Wireless Communication and Networking. John Wiley & Sons; 2022.
8. OpenAI. Hello GPT-4o [Web Page]; 2024. Available from: https: //openai.com/index/hello-gpt-4o/.
9. LiveBench [Web Page]; 2024. Available from: :https:// livebench.ai/.
10. Nouroloyouni A, Nazi Y, Mikaieli Xiavi H, Noorolouny S, Kuzekanani M, Plotino G, et al. Cone-Beam Computed To- mography Assessment of Prevalence of Procedural Errors in Maxillary Posterior Teeth. Biomed Res Int. 2023;2023:4439890. doi:10.1155/2023/4439890.
11. Johnsen I, Bårdsen A, Haug SR. Impact of Case Diffi- culty, Endodontic Mishaps, and Instrumentation Method on Endodontic Treatment Outcome and Quality of Life: A Four-Year Follow-up Study. J Endod. 2023;49(4):382–389. doi:10.1016/j.joen.2023.01.005.
12. Vaishya R, Misra A, Vaish A. ChatGPT: Is this version good for healthcare and research? Diabetes Metab Syndr. 2023;17(4):102744. doi:10.1016/j.dsx.2023.102744.
13. Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T. Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases. JMIR Form Res. 2024;8:e59267. doi:10.2196/59267.
14. Howard A, Hope W, Gerada A. ChatGPT and antimicrobial ad- vice: the end of the consulting infection doctor? Lancet Infect Dis. 2023;23(4):405–406. doi:10.1016/s1473-3099(23)00113-5.
15. Sallam M. ChatGPT Utility in Healthcare Education, Re- search, and Practice: Systematic Review on the Promising Per- spectives and Valid Concerns. Healthcare (Basel). 2023;11(6). doi:10.3390/healthcare11060887.
16. Manohar N, Prasad SS. Use of ChatGPT in Academic Publishing: A Rare Case of Seronegative Systemic Lupus Erythematosus in a Patient With HIV Infection. Cureus. 2023;15(2):e34616. doi:10.7759/cureus.34616.
17. Ramezanzade S, Laurentiu T, Bakhshandah A, Ibragimov B, Kvist T, Bjørndal L. The efficiency of artificial intelligence methods for finding radiographic features in different endodon- tic treatments - a systematic review. Acta Odontol Scand. 2023;81(6):422–435. doi:10.1080/00016357.2022.2158929.
18. Benary M, Wang XD, Schmidt M, Soll D, Hilfen- haus G, Nassir M, et al. Leveraging Large Lan- guage Models for Decision Support in Personalized Oncology. JAMA Netw Open. 2023;6(11):e2343689. doi:10.1001/jamanetworkopen.2023.43689.
19. Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopou- los V, Kaklamanos EG. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Den- tistry: Comparative Mixed Methods Study. J Med Internet Res. 2023;25:e51580. doi:10.2196/51580.
20. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phe- nomenon: Evaluating the consistency and accuracy of en- dodontic question answers. Int Endod J. 2024;57(1):108–113. doi:10.1111/iej.13985.
21. Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol. 2024. doi:10.1111/edt.12965.
22. Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388(13):1233– 1239. doi:10.1056/NEJMsr2214184.
23. Betzler BK, Chen H, Cheng CY, Lee CS, Ning G, Song SJ, et al. Large language models and their impact in ophthalmology. Lancet Digit Health. 2023;5(12):e917–e924. doi:10.1016/s2589- 7500(23)00201-7.
24. Bae J, Kwon S, Myeong SJE. Enhancing Software Code Vul- nerability Detection Using GPT-4o and Claude-3.5 Sonnet: A Study on Prompt Engineering Techniques. Electronics. 2024;13(13):2657. doi:10.3390/electronics13132657.
25. Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, et al. A survey on large language model based autonomous agents. Front Com- put Sci. 2024;18(6):186345. doi:10.1007/s11704-024-40231-1.
26. Suvarna A, Khandelwal H, Peng N. PhonologyBench: Eval- uating Phonological Skills of Large Language Models. ACL. 2024:1–14. doi:10.18653/v1/2024.knowllm-1.1.

There are 26 citations in total.

Details

Primary Language	English
Subjects	Endodontics
Journal Section	Research Article
Authors	Merve Sarı 0000-0002-9432-3809 Pelin Tufenkci 0000-0001-9881-5395
Submission Date	November 4, 2024
Acceptance Date	March 18, 2025
Early Pub Date	April 30, 2025
Publication Date	April 30, 2025
Published in Issue	Year 2025 Volume: 52 Issue: 1

Cite

Vancouver	Sarı M, Tufenkci P. Evaluation of the Competency of Large Language Models GPT-4o and Claude 3.5 Sonnet in Endodontic Emergencies. EADS. 2025;52(1):10-6.

Article Files

Full Text