Do they learn when they read? A two-stage evaluation of AI models’ orthopedic knowledge using Orthobullets and Miller’s review

Mahircan Demir; İbrahim Faruk Adıgüzel; Mustafa Dinç; Recep Karasu

doi:10.47582/jompac.1824569

Research Article

Okuduklarında öğreniyorlar mı? Orthobullets ve Miller’s review kullanılarak yapay zekâ modellerinin ortopedik bilgi düzeyinin iki aşamalı değerlendirilmesi

Year 2025, Volume: 6 Issue: 6, 788 - 792, 27.12.2025

Mahircan Demir , İbrahim Faruk Adıgüzel , Mustafa Dinç , Recep Karasu

https://doi.org/10.47582/jompac.1824569

https://izlik.org/JA72KS32KH

Abstract

Giriş/Amaç:
ChatGPT, Gemini, Claude ve Perplexity gibi büyük dil modelleri (LLM’ler) tıp eğitiminde giderek daha fazla kullanılmaktadır. Ancak bu modellerin ortopedik bilgi düzeyi ile yapılandırılmış referans materyallerinden öğrenme ve kendini geliştirme kapasiteleri hâlâ net değildir. Bu çalışma, dört gelişmiş LLM’in standart bir ders kaynağına maruz kalmadan önce ve sonra ortopedik bilgi performanslarını karşılaştırmayı ve alan-özgü eğitsel içeriğin model performansını artırıp artırmadığını belirlemeyi amaçlamıştır.

Gereç ve Yöntem:
Orthobullets platformundan elde edilen 110 çoktan seçmeli ortopedi sorusu kullanılarak iki aşamalı bir değerlendirme yapılmıştır. Her model, Miller’s Review of Orthopaedics kaynağına erişimden önce ve sonra test edilmiştir. Doğruluk oranları kaydedilmiş, model içi karşılaştırmalar için Wilcoxon işaretli sıralar testi, modeller arası karşılaştırmalar için ise Bonferroni düzeltmeli Kruskal–Wallis testi uygulanmıştır. Birincil sonuç ölçütü, eğitimsel maruziyet sonrası doğruluk yüzdesindeki değişimdir.

Bulgular/Sonuçlar:
Tüm modeller, ders kitabı maruziyetinden sonra anlamlı doğruluk artışı göstermiştir (p < 0.001). En büyük artış Gemini’de (+20.9%) gözlenmiş, bunu sırasıyla Claude (+10.9%), Perplexity (+10.0%) ve ChatGPT (+9.1%) takip etmiştir. Müdahale sonrası en yüksek toplam doğruluk oranına Perplexity (%90.0) ulaşırken, Claude en düşük performansa sahip model olarak kalmıştır. Gemini’nin artışı en yüksek olmasına rağmen, diğer modellerle karşılaştırıldığında istatistiksel anlamlılığa ulaşmamıştır (p = 0.052).

Sonuç:
Bu çalışma, büyük dil modelleri arasında ortopedik bilgi düzeyi ve öğrenme kapasitesi açısından belirgin farklılıklar olduğunu ortaya koymuştur. Alan-özgü referans materyaliyle desteklenen modellerde doğruluk artışı gözlenmiş olsa da, bu artışın büyüklüğü modele göre değişmektedir. Bulgular, LLM’lerin tıp eğitimine ve klinik karar destek süreçlerine entegrasyonunda model-temelli değerlendirme ve dikkatli yaklaşım gerekliliğini vurgulamaktadır. Daha geniş veri setleri ve gerçek yaşam klinik görevlerini içeren ileri çalışmalara ihtiyaç vardır.

Keywords

Yapay zekâ , Makine öğrenimi , Ortopedi , Tıp eğitimi , Klinik karar destek sistemleri , Büyük dil modelleri

Ethical Statement

Bu çalışma insan katılımcıları, hasta verilerini veya herhangi bir biyolojik materyali içermemektedir. Bu nedenle etik kurul onayı gerekmemiştir. Çalışma, büyük dil modelleri (LLM’ler) tarafından üretilen çıktılar kullanılarak yürütülmüş olup etik değerlendirmeye tabi herhangi bir veri içermemektedir.

Supporting Institution

Destekleyen kurum bulunmamaktadır.

References

Crompton H, Burke D. Artificial Intelligence in higher education: the state of the field. Int J Educ Technol High Educ. 2023;20:22. doi:10.1186/s41239-023-00392-8
Mah E. Metaverse, AR, machine learning and AI in orthopaedics. J Orthop Surg (Hong Kong). 2023;31(1):10225536231165362.
Federer SJ, Jones GG. Artificial Intelligence in orthopaedics: a scoping review. PLoS One. 2021;16(11):e0260471. doi:10.1371/journal.pone.0260471
Hamid T, Chhabra M, Ravulakollu K, Singh P, Dalal S, Dewan R. A review on artificial intelligence in orthopaedics. In: Proceedings of the 9th International Conference on Computing for Sustainable Global Development (INDIACom); 2022.
Haleem A, Vaishya R, Javaid M, Khan IH. Artificial Intelligence applications in orthopaedics: an innovative technology to embrace. J Clin Orthop Trauma. 2020;11(suppl 1):S80-S81. doi:10.1016/j.jcot.2019.07.012
Myers TG, Ramkumar PN, Ricciardi BF, Urish KL, Kipper J, Ketonis C. Artificial Intelligence and orthopaedics: an introduction for clinicians. J Bone Joint Surg Am. 2020;102(9):830-840. doi:10.2106/JBJS.19.01128
Kumar V, Patel S, Baburaj V, Vardhan A, Singh PK, Vaishya R. Current understanding on Artificial Intelligence and machine learning in orthopaedics: a scoping review. J Orthop. 2022;34:201-206. doi:10.1016/j.jor.2022.09.003
Hui AT, Alvandi LM, Eleswarapu AS, Fornari ED. Artificial intelligence in modern orthopaedics: current and future applications. JBJS Rev. 2022;10(10):e22. doi:10.2106/JBJS.RVW.22.00022
Familiari F, Saithna A, Martinez-Cano JP, et al. Exploring artificial intelligence in orthopaedics: a collaborative survey from the ISAKOS Young Professional Task Force. J Exp Orthop. 2025;12(1):e70181.
Clement ND, Simpson AHRW. Artificial Intelligence in orthopaedics: what level of evidence does it represent and how is it validated? Bone Joint Res. 2023;12(8):494-496. doi:10.1302/2046-3758.128.BJR-2023-0123
Gencer G, Gencer K. A comparative analysis of ChatGPT and medical faculty graduates in medical specialization exams: uncovering the potential of Artificial Intelligence in medical education. Cureus. 2024; 16(8):e66517. doi:10.7759/cureus.66517
Zsidai B, Hilkert AS, Kaarre J, et al. A practical guide to the implementation of AI in orthopaedic research—part 1: opportunities in clinical application and overcoming existing challenges. J Exp Orthop. 2023;10(1):117. doi:10.1186/s40634-023-00630-1
Coppola A, Asopa V. A practical approach to artificial intelligence in trauma and orthopaedics. J Trauma Orthop. 2024;12(2):30-32.
Ray PP. A critical analysis of the use of ChatGPT in orthopaedics. Int Orthop. 2023;47(10):2617-2618. doi:10.1007/s00264-023-05855-4
Rizzo MG, Cai N, Constantinescu D. The performance of ChatGPT on orthopaedic in-service training examinations: a comparative study of the GPT-3.5 Turbo and GPT-4 models in orthopaedic education. J Orthop. 2024;50:70-75. doi:10.1016/j.jor.2024.02.004
Gezer MC, Armangil M. Assessing the quality of ChatGPT’s responses to commonly asked questions about trigger finger treatment. Ulus Travma Acil Cerrahi Derg. 2025;31(4):389-393. doi:10.14744/tjtes.2025.32735
Bayrak HC, Karagoz B, Bayrak O. Comparative evaluation of large language model–based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity. Acta Orthop Traumatol Turc. 2025;in press:1-27. doi:10.5152/j.aott.2025.25428

Do they learn when they read? A two-stage evaluation of AI models’ orthopedic knowledge using Orthobullets and Miller’s review

Year 2025, Volume: 6 Issue: 6, 788 - 792, 27.12.2025

Mahircan Demir , İbrahim Faruk Adıgüzel , Mustafa Dinç , Recep Karasu

https://doi.org/10.47582/jompac.1824569

https://izlik.org/JA72KS32KH

Abstract

Aims: Large language models (LLMs) such as ChatGPT, Gemini, Claude, and Perplexity are increasingly incorporated into medical education; however, their baseline orthopedic knowledge and their ability to utilize structured reference materials remain insufficiently characterized. This study aimed to compare the performance of four advanced LLMs before and after exposure to a standardized orthopedic textbook and to determine whether domain-specific educational content enhances inference-time accuracy.
Methods: A two-stage evaluation was conducted using 110 multiple-choice questions from the Orthobullets platform. Each model first completed the question set under identical prompting conditions. A new chat session was then initiated, and the full PDF of Miller’s Review of Orthopaedics (9th edition) was uploaded using native document-processing functions. Models were subsequently retested with the same questions. Pre–post accuracy differences were analyzed using the Wilcoxon signed-rank test (effect size r calculated as Z/√N). Between-model differences were assessed using the Kruskal–Wallis test with Bonferroni adjusted pairwise comparisons. The primary outcome was the change in accuracy (%) after textbook exposure.
Results: All four models demonstrated significant improvement following access to the textbook (p<0.001). Gemini showed the greatest numerical gain (+20.9%), followed by Claude (+10.9%), Perplexity (+10.0%), and ChatGPT (+9.1%). Perplexity achieved the highest absolute post-exposure accuracy (90.0%), whereas Claude remained the lowest performer. Although Gemini exhibited the largest relative improvement, its advantage over the other models did not reach statistical significance (p=0.052).
Conclusion: Exposure to a standardized orthopedic textbook was associated with improved inference-time accuracy across all models, though the magnitude of benefit varied by platform. These findings underscore the heterogeneity of LLM performance in subspecialty medical topics and highlight the importance of model-specific benchmarking. Because LLMs do not undergo parameter-level learning during user interaction, observed improvements reflect temporary contextual integration rather than durable knowledge acquisition. Further research involving broader datasets, additional model architectures, and clinically oriented task evaluations is warranted.

Keywords

Artificial Intelligence , machine learning , orthopedics , education , medical , large language models

Ethical Statement

This study did not involve human participants, patient data, or any biological material. Therefore, ethics committee approval was not required. The study was conducted using outputs generated by large language models (LLMs) and did not include any data subject to ethical review.

Supporting Institution

There is no supporting institution for this study.

References

Crompton H, Burke D. Artificial Intelligence in higher education: the state of the field. Int J Educ Technol High Educ. 2023;20:22. doi:10.1186/s41239-023-00392-8
Mah E. Metaverse, AR, machine learning and AI in orthopaedics. J Orthop Surg (Hong Kong). 2023;31(1):10225536231165362.
Federer SJ, Jones GG. Artificial Intelligence in orthopaedics: a scoping review. PLoS One. 2021;16(11):e0260471. doi:10.1371/journal.pone.0260471
Hamid T, Chhabra M, Ravulakollu K, Singh P, Dalal S, Dewan R. A review on artificial intelligence in orthopaedics. In: Proceedings of the 9th International Conference on Computing for Sustainable Global Development (INDIACom); 2022.
Haleem A, Vaishya R, Javaid M, Khan IH. Artificial Intelligence applications in orthopaedics: an innovative technology to embrace. J Clin Orthop Trauma. 2020;11(suppl 1):S80-S81. doi:10.1016/j.jcot.2019.07.012
Myers TG, Ramkumar PN, Ricciardi BF, Urish KL, Kipper J, Ketonis C. Artificial Intelligence and orthopaedics: an introduction for clinicians. J Bone Joint Surg Am. 2020;102(9):830-840. doi:10.2106/JBJS.19.01128
Kumar V, Patel S, Baburaj V, Vardhan A, Singh PK, Vaishya R. Current understanding on Artificial Intelligence and machine learning in orthopaedics: a scoping review. J Orthop. 2022;34:201-206. doi:10.1016/j.jor.2022.09.003
Hui AT, Alvandi LM, Eleswarapu AS, Fornari ED. Artificial intelligence in modern orthopaedics: current and future applications. JBJS Rev. 2022;10(10):e22. doi:10.2106/JBJS.RVW.22.00022
Familiari F, Saithna A, Martinez-Cano JP, et al. Exploring artificial intelligence in orthopaedics: a collaborative survey from the ISAKOS Young Professional Task Force. J Exp Orthop. 2025;12(1):e70181.
Clement ND, Simpson AHRW. Artificial Intelligence in orthopaedics: what level of evidence does it represent and how is it validated? Bone Joint Res. 2023;12(8):494-496. doi:10.1302/2046-3758.128.BJR-2023-0123
Gencer G, Gencer K. A comparative analysis of ChatGPT and medical faculty graduates in medical specialization exams: uncovering the potential of Artificial Intelligence in medical education. Cureus. 2024; 16(8):e66517. doi:10.7759/cureus.66517
Zsidai B, Hilkert AS, Kaarre J, et al. A practical guide to the implementation of AI in orthopaedic research—part 1: opportunities in clinical application and overcoming existing challenges. J Exp Orthop. 2023;10(1):117. doi:10.1186/s40634-023-00630-1
Coppola A, Asopa V. A practical approach to artificial intelligence in trauma and orthopaedics. J Trauma Orthop. 2024;12(2):30-32.
Ray PP. A critical analysis of the use of ChatGPT in orthopaedics. Int Orthop. 2023;47(10):2617-2618. doi:10.1007/s00264-023-05855-4
Rizzo MG, Cai N, Constantinescu D. The performance of ChatGPT on orthopaedic in-service training examinations: a comparative study of the GPT-3.5 Turbo and GPT-4 models in orthopaedic education. J Orthop. 2024;50:70-75. doi:10.1016/j.jor.2024.02.004
Gezer MC, Armangil M. Assessing the quality of ChatGPT’s responses to commonly asked questions about trigger finger treatment. Ulus Travma Acil Cerrahi Derg. 2025;31(4):389-393. doi:10.14744/tjtes.2025.32735
Bayrak HC, Karagoz B, Bayrak O. Comparative evaluation of large language model–based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity. Acta Orthop Traumatol Turc. 2025;in press:1-27. doi:10.5152/j.aott.2025.25428

There are 17 citations in total.

Details

Primary Language	English
Subjects	Orthopaedics
Journal Section	Research Article
Authors	Mahircan Demir 0000-0002-7372-3280 İbrahim Faruk Adıgüzel 0000-0003-2493-5540 Mustafa Dinç 0000-0002-3002-5028 Recep Karasu 0000-0002-0628-5794
Submission Date	November 15, 2025
Acceptance Date	December 22, 2025
Publication Date	December 27, 2025
DOI	https://doi.org/10.47582/jompac.1824569
IZ	https://izlik.org/JA72KS32KH
Published in Issue	Year 2025 Volume: 6 Issue: 6

Cite

AMA	1.Demir M, Adıgüzel İF, Dinç M, Karasu R. Do they learn when they read? A two-stage evaluation of AI models’ orthopedic knowledge using Orthobullets and Miller’s review. J Med Palliat Care / JOMPAC / jompac. 2025;6(6):788-792. doi:10.47582/jompac.1824569

Article Files

Full Text

TR DİZİN ULAKBİM and International Indexes (1d)

Interuniversity Board (UAK) Equivalency: Article published in Ulakbim TR Index journal [10 POINTS], and Article published in other (excuding 1a, b, c) international indexed journal (1d) [5 POINTS]

Our journal is in TR-Dizin, DRJI (Directory of Research Journals Indexing, General Impact Factor, Google Scholar, Researchgate, CrossRef (DOI), ROAD, ASOS Index, Turk Medline Index, Eurasian Scientific Journal Index (ESJI), and Turkiye Citation Index.

EBSCO, DOAJ, OAJI and ProQuest Index are in process of evaluation.

Journal articles are evaluated as "Double-Blind Peer Review".