Comparative Evaluation of Four Large Language Models in Turkish Dentistry Specialization Exam

Ömer Ekici

doi:10.15311/selcukdentj.1674113

Research Article

Diş Hekimliği Uzmanlık Eğitimi Giriş Sınavında Claude 3.5, GPT-3.5, Co-Pilot ve Gemini 1.5' in Performansının Değerlendirilmesi

Year 2025, Volume: 12 Issue: 4, 6 - 10, 19.09.2025

Ömer Ekici

https://doi.org/10.15311/selcukdentj.1674113

Abstract

Amaç
Çalışmanın amacı 2021 Diş Hekimliği Uzmanlık Eğitimi giriş sınavında (DUS) önde gelen dört Büyük Dil Modeli (LLM)' nin performansını değerlendirmektir.
Gereç ve Yöntemler
2021 DUS sınavında sorulan şekil ve grafik içermeyen temel bilimlerde 39 soru ve klinik bilimlerde 73 soru olmak üzere 112 soru kullanıldı. Çalışmada Claude 3.5 Haiku, GPT-3.5, Copilot ve Gemini 1.5 olmak üzere dört LLM'nin performansı değerlendirildi.
Bulgular
Temel bilimlerde Claude-3.5 Haiku ve GPT-3.5 tüm soruları %100 doğru cevaplarken, Gemini 1.5 %94,9 ve Copilot %92,3 oranında cevapladı. Klinik bilimlerde toplamda Claude 3.5 Haiku %89, Copilot %80,9, GPT-3.5 %79,7 ve Gemini 1.5 %65,7 doğru cevap oranı sergiledi. Tüm sorularda ise Claude 3.5 Haiku %92,85, GPT-3.5 %86,6, Copilot %84,8 ve Gemini %75,9 doğru cevap oranı gösterdi. Temel bilimlerde LLM'lerin performansı benzer iken (p=0.134), klinik bilimlerde ve tüm sorularda LLM’lerin performansları arasında istatistiksel açıdan anlamlı farklılık görüldü (sırasıyla p=0.007 ve p=0.005).
Sonuç
Tüm sorularda Claude 3.5 Haiku en iyi performansı gösterirken, Gemini 1.5 en kötü performansı gösterdi, GPT 3.5 ve Co pilot'un performansı benzer bulundu. İncelenen 4 LLM modeli temel bilimlerde klinik bilimlere göre daha yüksek bir başarı oranı gösterdi. Sonuçlar, yapay zeka tabanlı LLM'lerin temel bilimler gibi bilgiye dayalı sorularda iyi performans sergileyebileceğini ancak klinik bilimler gibi bilgi ile birlikte klinik muhakeme, tartışma ve yorum gerektiren sorularda daha düşük performans sergilediğini gösterdi.
Anahtar Kelimeler
Yapay zeka, Diş Hekimliği, Diş Hekimliği uzmanlık eğitimi, Büyük dil modeli.

Keywords

Yapay zeka , Diş Hekimliği , Diş Hekimliği uzmanlık eğitimi , Büyük dil modeli.

References

1. Dashti M, Londono J, Ghasemi S, et al. Attitudes, knowledge, and perceptions of dentists and dental students toward artificial intelligence: a systematic review. J Taibah Univ Med Sci. 2024;19(2):327-337. doi:10.1016/j.jtumed.2023.12.010
2. Chakravorty S, Aulakh BK, Shil M, Nepale M, Puthenkandathil R, Syed W. Role of Artificial Intelligence (AI) in Dentistry: A Literature Review. J Pharm Bioallied Sci. 2024;16(Suppl 1):S14-S16. doi:10.4103/JPBS. JPBS_466_23,
3. Sur J, Bose S, Khan F, Dewangan D, Sawriya E, Roul A. Knowledge, attitudes, and perceptions regarding the future of artificial intelligence in oral radiology in India: A survey. Imaging Sci Dent. 2020;50(3):193-198. doi:10.5624/ISD.2020.50.3.193
4. Eggmann F, Weiger R, Zitzmann NU, Blatz MB. Implications of large language models such as ChatGPT for dental medicine. J Esthet Restor Dent. 2023;35(7):1098-1102. doi:10.1111/JERD.13046
5. Shrivastava PK, Uppal S, Kumar G, Jha P. Role of ChatGPT in Academia: Dental Students’ Perspectives. Prim Dent J. 2024;13(1):89-90. doi:10.1177/20501684241230191,
6. Rahad K, Martin K, Amugo I, et al. ChatGPT to Enhance Learning in Dental Education at a Historically Black Medical College. Dent Res oral Heal. 2024;7(1). doi:10.26502/DROH.0069
7. Kasneci E, Sessler K, Küchemann S, et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn Individ Differ. 2023;103:102274. doi:10.1016/J.LINDIF.2023.102274
8. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940. doi:10.1038/S41591-023-02448-8
9. Chau RCW, Thu KM, Yu OY, Hsung RTC, Lo ECM, Lam WYH. Performance of Generative Artificial Intelligence in Dental Licensing Examinations. Int Dent J. 2024;74(3):616-621. doi:10.1016/j.identj.2023.12.007
10. Alhaidry HM, Fatani B, Alrayes JO, Almana AM, Alfhaed NK. ChatGPT in Dentistry: A Comprehensive Review. Cureus. 2023;15(4):e38317. doi:10.7759/CUREUS.38317
11. Huang H, Zheng O, Wang D, et al. ChatGPT for shaping the future of dentistry: the potential of multi-modal large language model. Int J Oral Sci. 2023;15(1):1-13. doi:10.1038/S41368-023-00239-Y;SUBJMETA=139,1449,3032,692,700;KWRD=DENTISTRY, ELECTRODIAGNOSIS
12. Ji Z, Lee N, Frieske R, et al. Survey of Hallucination in Natural Language Generation. ACM Comput Surv. 2022;55(12). doi:10.1145/3571730
13. Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study. JMIR Med Educ. 2023;9. doi:10.2196/48002
14. Farajollahi M, Modaberi A. Can ChatGPT pass the “Iranian Endodontics Specialist Board” exam? Iran Endod J. 2023;18(3):192. doi:10.22037/iej.v18i3.42154
15. Ohta K, Ohta S. The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study. Cureus. 2023;15(12). doi:10.7759/CUREUS.50369
16. Yamaguchi S, Morishita M, Fukuda H, et al. Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat. J Dent Sci. 2024;19(4):2262-2267. doi:10.1016/J.JDS.2024.02.019
17. Song ES, Lee SP. Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions. Int J Dent Hyg. Published online 2024. doi:10.1111/IDH.12848
18. Jaworski A, Jasiński D, Sławińska B, et al. GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination. Cureus. 2024;16(9). doi:10.7759/CUREUS.68813
19. Künzle P, Paris S. Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments. Clin Oral Investig. 2024;28(11):575. doi:10.1007/S00784-024-05968-W
20. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. Int Endod J. 2024;57(1):108-113. doi:10.1111/IEJ.13985
21. Morishita M, Fukuda H, Muraoka K, et al. Evaluating GPT-4V’s performance in the Japanese national dental examination: A challenge explored. J Dent Sci. 2024;19(3):1595-1600. doi:10.1016/J.JDS.2023.12.007

Comparative Evaluation of Four Large Language Models in Turkish Dentistry Specialization Exam

Year 2025, Volume: 12 Issue: 4, 6 - 10, 19.09.2025

Ömer Ekici

https://doi.org/10.15311/selcukdentj.1674113

Abstract

Background
The aim of the study is to evaluate the performance of four leading Large Language Models (LLMs) in the 2021 Dentistry Specialization Training Exam (DSE).
Methods
A total of 112 questions were used, including 39 questions in basic sciences and 73 questions in clinical sciences, which did not include the figures and graphs asked in the 2021 DSE. The study evaluated the performance of four LLMs: Claude-3.5 Haiku, GPT-3.5, Co-pilot, and Gemini-1.5.
Results
In basic sciences, Claude-3.5 Haiku and GPT-3.5 answered all questions correctly by 100%, while Gemini-1.5 answered by 94.9% and Co-pilot by 92.3%. In clinical sciences, Claude-3.5 Haiku showed an overall correct answer rate of 89%, Co-pilot 80.9%, GPT-3.5 79.7% and Gemini-1.5 65.7%. For all questions, Claude-3.5 Haiku showed a correct answer rate of 92.85%, GPT-3.5 86.6%, Co-pilot 84.8% and Gemini-1.5 75.9%. While the performance of LLMs in basic sciences was similar (p=0.134), there was a statistically significant difference between the performances of LLMs in clinical sciences and all questions (p=0.007 and p=0.005, respectively).
Conclusion
In all questions and clinical sciences, Claude-3.5 Haiku performed best, Gemini-1.5 performed worst, and GPT-3.5 and Co-pilot performed similarly. The 4 LLM models examined showed a higher success rate in basic sciences than in clinical sciences. The results showed that AI-based LLMs can perform well in knowledge-based questions such as basic sciences but perform poorly in questions that require knowledge as well as clinical reasoning, discussion, and interpretation, such as clinical sciences.
Keywords
Artificial intelligence, Dentistry, Dentistry specialization training, Large language model

Keywords

Artificial intelligence , Dentistry , Dentistry specialization training , Large language model

Ethical Statement

Since this study used only publicly available internet data and did not involve human participants, ethics committee approval was not required.

References

1. Dashti M, Londono J, Ghasemi S, et al. Attitudes, knowledge, and perceptions of dentists and dental students toward artificial intelligence: a systematic review. J Taibah Univ Med Sci. 2024;19(2):327-337. doi:10.1016/j.jtumed.2023.12.010
2. Chakravorty S, Aulakh BK, Shil M, Nepale M, Puthenkandathil R, Syed W. Role of Artificial Intelligence (AI) in Dentistry: A Literature Review. J Pharm Bioallied Sci. 2024;16(Suppl 1):S14-S16. doi:10.4103/JPBS. JPBS_466_23,
3. Sur J, Bose S, Khan F, Dewangan D, Sawriya E, Roul A. Knowledge, attitudes, and perceptions regarding the future of artificial intelligence in oral radiology in India: A survey. Imaging Sci Dent. 2020;50(3):193-198. doi:10.5624/ISD.2020.50.3.193
4. Eggmann F, Weiger R, Zitzmann NU, Blatz MB. Implications of large language models such as ChatGPT for dental medicine. J Esthet Restor Dent. 2023;35(7):1098-1102. doi:10.1111/JERD.13046
5. Shrivastava PK, Uppal S, Kumar G, Jha P. Role of ChatGPT in Academia: Dental Students’ Perspectives. Prim Dent J. 2024;13(1):89-90. doi:10.1177/20501684241230191,
6. Rahad K, Martin K, Amugo I, et al. ChatGPT to Enhance Learning in Dental Education at a Historically Black Medical College. Dent Res oral Heal. 2024;7(1). doi:10.26502/DROH.0069
7. Kasneci E, Sessler K, Küchemann S, et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn Individ Differ. 2023;103:102274. doi:10.1016/J.LINDIF.2023.102274
8. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940. doi:10.1038/S41591-023-02448-8
9. Chau RCW, Thu KM, Yu OY, Hsung RTC, Lo ECM, Lam WYH. Performance of Generative Artificial Intelligence in Dental Licensing Examinations. Int Dent J. 2024;74(3):616-621. doi:10.1016/j.identj.2023.12.007
10. Alhaidry HM, Fatani B, Alrayes JO, Almana AM, Alfhaed NK. ChatGPT in Dentistry: A Comprehensive Review. Cureus. 2023;15(4):e38317. doi:10.7759/CUREUS.38317
11. Huang H, Zheng O, Wang D, et al. ChatGPT for shaping the future of dentistry: the potential of multi-modal large language model. Int J Oral Sci. 2023;15(1):1-13. doi:10.1038/S41368-023-00239-Y;SUBJMETA=139,1449,3032,692,700;KWRD=DENTISTRY, ELECTRODIAGNOSIS
12. Ji Z, Lee N, Frieske R, et al. Survey of Hallucination in Natural Language Generation. ACM Comput Surv. 2022;55(12). doi:10.1145/3571730
13. Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study. JMIR Med Educ. 2023;9. doi:10.2196/48002
14. Farajollahi M, Modaberi A. Can ChatGPT pass the “Iranian Endodontics Specialist Board” exam? Iran Endod J. 2023;18(3):192. doi:10.22037/iej.v18i3.42154
15. Ohta K, Ohta S. The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study. Cureus. 2023;15(12). doi:10.7759/CUREUS.50369
16. Yamaguchi S, Morishita M, Fukuda H, et al. Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat. J Dent Sci. 2024;19(4):2262-2267. doi:10.1016/J.JDS.2024.02.019
17. Song ES, Lee SP. Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions. Int J Dent Hyg. Published online 2024. doi:10.1111/IDH.12848
18. Jaworski A, Jasiński D, Sławińska B, et al. GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination. Cureus. 2024;16(9). doi:10.7759/CUREUS.68813
19. Künzle P, Paris S. Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments. Clin Oral Investig. 2024;28(11):575. doi:10.1007/S00784-024-05968-W
20. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. Int Endod J. 2024;57(1):108-113. doi:10.1111/IEJ.13985
21. Morishita M, Fukuda H, Muraoka K, et al. Evaluating GPT-4V’s performance in the Japanese national dental examination: A challenge explored. J Dent Sci. 2024;19(3):1595-1600. doi:10.1016/J.JDS.2023.12.007

There are 21 citations in total.

Details

Primary Language	English
Subjects	Dentistry (Other)
Journal Section	Research
Authors	Ömer Ekici 0000-0002-7902-9601
Publication Date	September 19, 2025
Submission Date	April 11, 2025
Acceptance Date	June 30, 2025
Published in Issue	Year 2025 Volume: 12 Issue: 4

Cite

Vancouver	Ekici Ö. Comparative Evaluation of Four Large Language Models in Turkish Dentistry Specialization Exam. Selcuk Dent J. 2025;12(4):6-10.

Download Cover Image

Article Files

Full Text

Selcuk Dental Journal is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY NC).