Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation

Ali Geçer; Emre Kaya; Alper Şükrü Kendirci; Alp Paksoy; Doruk Akgün

doi:10.47482/acmr.1778992

TR EN

Büyük Dil Modellerinin Lateral Epikondilit Hakkındaki Hasta Sorularına Yanıtları: Çok Merkezli Değerlendirme

Öz

Amaç: Lateral epikondilit (tenisçi dirseği), dirsek ağrısının sık görülen bir nedenidir. İnternetin ve yapay zekânın (YZ) sağlık bilgisi ediniminde artan kullanımıyla birlikte, büyük dil modelleri (BDM’ler) hastalar tarafından sıklıkla danışılan kaynaklar hâline gelmiştir. Bu çalışmada, lateral epikondilit ile ilgili sık sorulan hasta sorularına ChatGPT-3.5, ChatGPT-4, Gemini ve Copilot modellerinin verdiği yanıtların doğruluk, güvenilirlik, içerik kalitesi ve okunabilirlik açısından değerlendirilmesi amaçlanmıştır. Yöntemler: Yazarlar komitesi, Google arama motorunu kullanarak çeşitli web sitelerinden lateral epikondilit ile ilgili hasta sorularını taramış ve en sık sorulan 12 soruyu çalışmaya dâhil etmiştir. Bu sorular dört farklı YZ modeline (ChatGPT-3.5, ChatGPT-4, Gemini ve Copilot) yöneltilmiştir. Modellerin verdiği yanıtlar dört ölçüt kullanılarak değerlendirilmiştir: doğruluk (beşli Likert ölçeği), güvenilirlik (modifiye DISCERN ölçeği), kalite (Global Quality Scale [GQS]) ve okunabilirlik (Flesch Reading Ease Score [FRES]). Bulgular: ChatGPT-3.5 yanıtları en yüksek ortalama Likert skoruna (4,11±0,24) sahipti; bunu Gemini (4,11±0,17), Copilot (4,05±0,23) ve ChatGPT-4 (3,95±0,21) izledi. Chatbot modelleri arasındaki Likert skorları farkı istatistiksel olarak anlamlı bulunmadı (p>0,05). Copilot en yüksek modifiye DISCERN skorunu (3,51±0,25) elde etti; ardından Gemini (3,36±0,40), ChatGPT-3.5 (3,19±0,19) ve ChatGPT-4 (2,93±0,19) geldi. Copilot ve Gemini’nin, ChatGPT-4’e kıyasla anlamlı derecede daha yüksek skorlar aldığı görüldü (p<0,05). GQS skorlarında en yüksek değer ChatGPT-3.5’e (3,86±0,20) aitti; ardından Gemini (3,80±4,33), Copilot (3,59±0,27) ve ChatGPT-4 (3,40±0,22) sıralandı. ChatGPT-3.5 ve Gemini, ChatGPT-4’e göre anlamlı olarak daha yüksek skor elde etti (p<0,05). GQS değerlendirmesinde yüksek kaliteli yanıtların oranı Gemini için %33, ChatGPT-3.5 için %25, Copilot için %8 ve ChatGPT-4 için %0 olarak bulundu. Ortalama FRES değerleri Gemini için 47,71±19,78, Copilot için 43,17±10,91, ChatGPT-4 için 37,72±15,73 ve ChatGPT-3.5 için 29,73±16,03 idi. Bu değerler, yanıtların genel olarak “zor okunur” seviyede olduğunu göstermektedir. Sonuç: Tüm chatbot modelleri tenisçi dirseği hakkındaki sorulara genel olarak doğru ve kaliteli yanıtlar vermiştir. En güvenilir yanıtlar Copilot ve Gemini tarafından sunulurken, en yüksek içerik kalitesi Gemini ve ChatGPT-3.5 tarafından sağlanmıştır. Ancak ChatGPT-3.5 ve ChatGPT-4’ün önemli bir kısıtlılığı, yanıtlarında kaynak veya atıf göstermemeleri olmuştur. Ayrıca tüm modellerin yanıtlarının okunabilirlik açısından güçlük taşıdığı saptanmıştır.

Anahtar Kelimeler

tenis, dirsek, ağrı, spor yaralanması, büyük dil modelleri, yapay zeka, tendinit

Etik Beyan

Yazarlar, bu çalışma ile ilgili herhangi bir çıkar çatışması bulunmadığını beyan etmektedir.

Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation

Öz

Background: Lateral epicondylitis (tennis elbow) is a common cause of elbow pain. With the increasing use of the internet and artificial intelligence (AI) for health information, large language models (LLMs) are frequently consulted by patients. This study aimed to evaluate the accuracy, reliability, content quality, and readability of responses provided by different large language models (ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot) to frequently asked patient questions about lateral epicondylitis. Methods: The author committee reviewed patient-oriented questions on lateral epicondylitis using Google searches and selected the 12 most frequently asked questions for inclusion. These questions were presented to four LLMs: ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot. Responses were evaluated for accuracy using a five-point Likert scale, reliability using the modified DISCERN scale, quality using the Global Quality Scale (GQS), and readability using the Flesch Reading Ease Score (FRES). Results: Perceived medical accuracy did not differ significantly among the LLMs (p = 0.579). Reliability differed significantly (modified DISCERN: p < 0.001), with Copilot and Gemini achieving higher scores than ChatGPT-4 (both p < 0.001) and Copilot also outperforming ChatGPT-3.5 (p = 0.002). Quality differed significantly (GQS: p < 0.001), with ChatGPT-3.5 and Gemini scoring higher than ChatGPT-4 (p = 0.001 and p = 0.006, respectively). Readability differed across models (FRES: p = 0.049); Gemini demonstrated higher readability than ChatGPT-3.5 (p = 0.040), while responses from all models were generally difficult to read. Response generation time differed significantly (p < 0.001), with ChatGPT-4 producing the slowest responses. Conclusions: All evaluated LLMs provided generally accurate and moderately reliable responses to questions about tennis elbow, with differences observed across specific quality domains such as source transparency, readability, and response time. Models with citation capabilities demonstrated higher reliability in terms of source transparency, while readability remained a common limitation. LLMs show potential as supplementary patient information tools in orthopaedic; however, further refinement and improved readability are needed before widespread clinical use.

Anahtar Kelimeler

tennis, elbow, pain, sports injuries, large language models, artificial intelligence, tendinitis

Destekleyen Kurum

The authors did not receive any financial support for the submitted work.

Etik Beyan

The authors declare that they have no conflict of interest related to this study.

Teşekkür

Not applicable.

Kaynakça

Finestone HM, Rabinovitch DL. Tennis elbow no more: practical eccentric and concentric exercises to heal the pain. Can Fam Physician. 2008;54(8):1115-6.
Tyrrell Burrus M, Werner BC, Starman JS, Kurkis GM, Pierre JM, Diduch DR, et al. Patient perceptions and current trends in internet use by orthopedic outpatients. HSS J. 2017;13(3):271-5.
Koenig S, Nadarajah V, Smuda MP, Meredith S, Packer JD, Henn RF. Patients' use and perception of internet-based orthopaedic sports medicine resources. Orthop J Sports Med. 2018;6(9):232596711879646.
Krempec J, Hall J, Biermann JS. Internet use by patients in orthopaedic surgery. Iowa Orthop J. 2003;23:80-2.
Abu Arqub S, Al-Moghrabi D, Allareddy V, Upadhyay M, Vaid N, Yadav S. Content analysis of AI-generated (ChatGPT) responses concerning orthodontic clear aligners. Angle Orthod. 2024;94(3):263-72.
Nagendraswamy C, Amogh S. A review article on artificial intelligence. Ann Biomed Sci Eng. 2021;5(1):13-4.
Chakraborty C, Pal S, Bhattacharya M, Dash S, Lee SS. Overview of chatbots with special emphasis on artificial intelligence-enabled ChatGPT in medical science. Front Artif Intell. 2023;6:1253929.
Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the ChatGPT model. Res Sq. 2023. doi:10.21203/rs.3.rs-2566942/v1.
Onder CE, Koc G, Gokbulut P, Taskaldiran I, Kuskonmaz SM. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci Rep. 2024;14(1):243.
Giorgino R, Alessandri-Bonetti M, Del Re M, Verdoni F, Peretti GM, Mangiavini L. Google Bard and ChatGPT in orthopedics: which is the better doctor in sports medicine and pediatric orthopedics? The role of AI in patient education. Diagnostics (Basel). 2024;14(12):1253.

Ghanem YK, Rouhi AD, Al-Houssan A, Saleh Z, Moccia MC, Joshi H, et al. Dr Google to Dr ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis. Surg Endosc. 2024;38(5):2887-93.
Yalamanchili A, Sengupta B, Song J, Lim S, Thomas TO, Mittal BB, et al. Quality of large language model responses to radiation oncology patient care questions. JAMA Netw Open. 2024;7(4):e244630.
Cardona G, Argiles M, Pérez-Mañá L. Accuracy of a large language model as a new tool for optometry education. Clin Exp Optom. 2023.
Sullivan GM, Artino AR. Analyzing and interpreting data from Likert-type scales. J Grad Med Educ. 2013;5(4):541-2.
Griffiths KM, Christensen H. Website quality indicators for consumers. J Med Internet Res. 2005;7(5):e55.
Sanger S. DISCERN in practice. Health Expect. 1998;1(2):135-6.
Charnock D, Shepperd S, Needham G, Gann R. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health. 1999;53(2):105-11.
Cakir H, Caglar U, Sekkeli S, Zerdali E, Sarilar O, Yildiz O, et al. Evaluating ChatGPT ability to answer urinary tract infection-related questions. Infect Dis Now. 2024;54(4):104884.
Dursun D, Bilici Geçer R. Can artificial intelligence models serve as patient information consultants in orthodontics? BMC Med Inform Decis Mak. 2024;24(1):211.
Fahy S, Niemann M, Böhm P, Winkler T, Oehme S. Assessment of the quality and readability of information provided by ChatGPT in relation to the use of platelet-rich plasma therapy for osteoarthritis. J Pers Med. 2024;14(5):495.
Temel MH, Erden Y, Bağcıer F. Information quality and readability: ChatGPT's responses to the most common questions about spinal cord injury. World Neurosurg. 2024;181:e1138-44.
Tam TYC, Sivarajkumar S, Kapoor S, Stolyar AV, Polanska K, McCarthy KR, et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit Med. 2024;7:312.
Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155-63.
White CA, Masturov YA, Haunschild E, Michaelson E, Shukla DR, Cagle PJ. Can ChatGPT reliably answer the most common patient questions regarding total shoulder arthroplasty? J Shoulder Elbow Surg. 2025;34(5):e254-64.
Daraqel B, Wafaie K, Mohammed H, Cao L, Mheissen S, Liu Y, et al. The performance of artificial intelligence models in generating responses to general orthodontic questions: ChatGPT vs Google Bard. Am J Orthod Dentofacial Orthop. 2024;165(6):652-62.
Youssef Y, Youssef S, Melcher P, Henkelmann R, Osterhoff G, Theopold J. How accurately can ChatGPT 3.5 answer frequently asked questions by patients on glenohumeral osteoarthritis? Obere Extrem. 2025;20:205-10.
Zhang S, Liau ZQG, Tan KLM, Chua WL. Evaluating the accuracy and relevance of ChatGPT responses to frequently asked questions regarding total knee replacement. Knee Surg Relat Res. 2024;36(1):15.
Giuffrè M, Kresevic S, You K, Dupont J, Huebner J, Grimshaw AA, et al. Systematic review: The use of large language models as medical chatbots in digestive diseases. Aliment Pharmacol Ther. 2024;60(2):144–66.
Yeramosu T, Johns WL, Onor G, Menendez ME, Namdari S, Hammoud S. ChatGPT is capable of providing satisfactory responses to frequently asked questions regarding total shoulder arthroplasty. Shoulder Elbow. 2024;16(4):407–12.
Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J. 2024;57(3):305–14.
Makrygiannakis MA, Giannakopoulos K, Kaklamanos EG. Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing. Eur J Orthod. 2025;48(1):cjae017.
Gilmore N, Kushner JN, Redden A, Hansen AW, Yerke Hansen P, Martinez L. Assessing ChatGPT Responses to Common Patient Questions on Knee Osteoarthritis. Journal of Orthopaedic Experience & Innovation. 2024 Nov 1.
Gupta S, Tarapore R, Haislup B, Fillar A. Microsoft Copilot Provides More Accurate and Reliable Information About Anterior Cruciate Ligament Injury and Repair Than ChatGPT and Google Gemini; However, No Resource Was Overall the Best. Arthrosc Sports Med Rehabil. 2024;7(2):101043.
Reyhan AH, Mutaf Ç, Uzun İ, Yüksekyayla F. A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity. J Clin Med. 2024;13(21):6512.
Chundi G, Dawar A, Sarwar S, Prasad S, Vosbikian M, Ahmed I. Comparative evaluation of LLMs in orthopedic surgery. Journal of Orthopaedic Reports. 2026;5(2):100728.
Goktas P, Grzybowski A. Assessing the Impact of ChatGPT in Dermatology: A Comprehensive Rapid Review. J Clin Med. 2024;13(19):5909.
Rana N, Katoch N. AI for Biophysical Phenomena: A Comparative Study of ChatGPT and Gemini in Explaining Liquid–Liquid Phase Separation. Applied Sciences. 2024;14(12):5065.
Gomez-Cabello CA, Borna S, Pressman SM, Haider SA, Forte AJ. Large Language Models for Intraoperative Decision Support in Plastic Surgery: A Comparison between ChatGPT-4 and Gemini. Medicina (B Aires). 2024;60(6):957.
Wang YL, Tian LC, Meng JY, Zhang JC, Nie ZX, Wei WR, et al. Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study. BMC Medical Informatics and Decision Making. 2025;25(1):289.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Ortopedi, Spor Hekimliği

Bölüm

Araştırma Makalesi

Yazarlar

Ali Geçer ^*
0000-0002-9807-0968
Türkiye

Emre Kaya
0000-0002-9493-8790
Türkiye

Alper Şükrü Kendirci
0000-0001-6250-2469
Türkiye

Alp Paksoy
0000-0002-1657-8961
Germany

Doruk Akgün
0000-0002-5958-4472
Germany

Yayımlanma Tarihi

2 Haziran 2026

Gönderilme Tarihi

10 Eylül 2025

Kabul Tarihi

20 Ocak 2026

Yayımlandığı Sayı

Yıl 2026 Cilt: 7 Sayı: 2

DOI

https://doi.org/10.47482/acmr.1778992

IZ

https://izlik.org/JA86RK56DB

APA

Geçer, A., Kaya, E., Kendirci, A. Ş., Paksoy, A., & Akgün, D. (2026). Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation. Archives of Current Medical Research, 7(2), 321-330. https://doi.org/10.47482/acmr.1778992

AMA

1.Geçer A, Kaya E, Kendirci AŞ, Paksoy A, Akgün D. Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation. Arch Curr Med Res. 2026;7(2):321-330. doi:10.47482/acmr.1778992

Chicago

Geçer, Ali, Emre Kaya, Alper Şükrü Kendirci, Alp Paksoy, ve Doruk Akgün. 2026. “Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation”. Archives of Current Medical Research 7 (2): 321-30. https://doi.org/10.47482/acmr.1778992.

EndNote

Geçer A, Kaya E, Kendirci AŞ, Paksoy A, Akgün D (01 Haziran 2026) Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation. Archives of Current Medical Research 7 2 321–330.

IEEE

[1]A. Geçer, E. Kaya, A. Ş. Kendirci, A. Paksoy, ve D. Akgün, “Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation”, Arch Curr Med Res, c. 7, sy 2, ss. 321–330, Haz. 2026, doi: 10.47482/acmr.1778992.

ISNAD

Geçer, Ali - Kaya, Emre - Kendirci, Alper Şükrü - Paksoy, Alp - Akgün, Doruk. “Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation”. Archives of Current Medical Research 7/2 (01 Haziran 2026): 321-330. https://doi.org/10.47482/acmr.1778992.

JAMA

1.Geçer A, Kaya E, Kendirci AŞ, Paksoy A, Akgün D. Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation. Arch Curr Med Res. 2026;7:321–330.

MLA

Geçer, Ali, vd. “Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation”. Archives of Current Medical Research, c. 7, sy 2, Haziran 2026, ss. 321-30, doi:10.47482/acmr.1778992.

Vancouver

1.Ali Geçer, Emre Kaya, Alper Şükrü Kendirci, Alp Paksoy, Doruk Akgün. Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation. Arch Curr Med Res. 01 Haziran 2026;7(2):321-30. doi:10.47482/acmr.1778992

Archives of Current Medical Research (ACMR), araştırmaları ücretsiz sunmanın daha büyük bir küresel bilgi alışverişini desteklediğini göz önünde bulundurarak, tüm içeriğe anında açık erişim sağlar. Kamunun erişimine açık olması, daha büyük bir küresel bilgi alışverişini destekler.

http://www.acmronline.org/

Büyük Dil Modellerinin Lateral Epikondilit Hakkındaki Hasta Sorularına Yanıtları: Çok Merkezli Değerlendirme

Öz

Anahtar Kelimeler

Etik Beyan

Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation

Öz

Anahtar Kelimeler

Destekleyen Kurum

Etik Beyan

Teşekkür

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

DOI

IZ

Kaynak Göster