Araştırma Makalesi
BibTex RIS Kaynak Göster

AI vs AI: clinical reasoning performance of language models in orthopedic rehabilitation

Yıl 2025, Cilt: 8 Sayı: 5, 825 - 831, 16.09.2025
https://doi.org/10.32322/jhsm.1743257

Öz

Aims: This study aimed to compare the clinical reasoning and treatment planning performance of three advanced large language models (LLMs)-ChatGPT-4o, Gemini 2.5 Pro, and DeepSeek-V3-in orthopedic rehabilitation. Their responses to standardized clinical scenarios were evaluated to determine alignment with evidence‑based physiotherapy practices, focusing on relevance, accuracy, completeness, applicability, and safety awareness.
Methods: Three fictional but clinically realistic scenarios involving rotator cuff tendinopathy, lumbar disc herniation with radiculopathy, and anterior cruciate ligament (ACL) reconstruction were developed by an experienced physiotherapist. These scenarios were independently queried on the same day by three AI models using identical prompts. A blinded expert physiotherapist evaluated each model’s detailed responses using a 5-point Likert Scale across five domains: clinical accuracy, relevance, completeness, applicability, and safety awareness. Mean scores and descriptive statistics were calculated.
Results: DeepSeek-V3 was consistently rated highest (5/5) across all domains and scenarios, demonstrating comprehensive and clinically rigorous plans. ChatGPT-4o showed strong performance overall, with total scores ranging from 19 to 20 out of 25, though it exhibited lower completeness scores due to less specific milestones. Gemini 2.5 Pro scored lower overall (average total score 18/25), with particular weaknesses in applicability and clinical relevance in complex cases such as lumbar disc herniation. All models provided evidence-based treatment approaches emphasizing pain management, postural correction, gradual strengthening, and return-to-activity progression. Differences arose in emphasis on lifestyle modification, patient education depth, and integration of psychosocial factors, with Gemini uniquely addressing psychological readiness in ACL rehabilitation.
Conclusion: AI-generated rehabilitation plans show substantial concordance with current physiotherapy guidelines but vary in detail and clinical practicality. DeepSeek-V3 outperformed the other models in consistency and safety considerations, while ChatGPT-4o balanced clinical accuracy with moderate completeness. Gemini 2.5 Pro’s inclusion of biopsychosocial components offers valuable insights but may require further refinement for clinical applicability. These findings highlight the potential and current limitations of AI tools in orthopedic rehabilitation, suggesting careful model selection based on clinical context and user needs.

Kaynakça

  • Zhou Z. Evaluation of ChatGPT’s capabilities in medical report generation. Cureus. 2023;15(4):e37589. doi:10.7759/cureus.37589
  • Onan D, Arıkan H, Can İ, Güven Ş, Işıkay L, Ozge A. Examining the ability of artificial intelligence with ChatGPT-4.0 to create an exercise program: case scenario examples” lumbar disc herniation, chronic migraine, and urge urinary incontinence”. Turk J Kinesiol. 2025;11(1): 28-44.
  • Li J, Dada A, Puladi B, Kleesiek J, Egger J. ChatGPT in healthcare: a taxonomy and systematic review. Comput Method Progr Biomed. 2024; 245:108013. doi:10.1016/j.cmpb.2024.108013
  • Tustumi F, Andreollo NA, Aguilar-Nascimento JEd. Future of the language models in healthcare: the role of ChatGPT. Arq Bras Cir Dig. 2023;36:e1727. doi:10.1590/0102-672020230002e1727
  • Vaishya R, Misra A, Vaish A. ChatGPT: is this version good for healthcare and research? Diabetes Metabol Syndr. 2023;17(4):102744. doi:10.1016/j.dsx.2023.102744
  • Wang C, Liu S, Yang H, Guo J, Wu Y, Liu J. Ethical considerations of using ChatGPT in health care. J Med Internet Res. 2023;25:e48009. doi: 10.2196/48009
  • Giorgino R, Alessandri-Bonetti M, Del Re M, Verdoni F, Peretti GM, Mangiavini L. Google Bard and ChatGPT in orthopedics: which is the better doctor in sports medicine and pediatric orthopedics? The role of AI in patient education. Diagnostics. 2024;14(12):1253. doi:10.3390/diagnostics14121253
  • Artioli E, Veronesi F, Mazzotti A, et al. Assessing ChatGPT responses to common patient questions regarding total ankle arthroplasty. J Exp Orthop. 2025;12(1):e70138. doi:10.1002/jeo2.70138
  • Giangarra CE, Manske RC. Clinical orthopaedic rehabilitation: a team approach E-book: Elsevier Health Sciences; 2017.
  • Sueki D, Brechter J. Orthopedic rehabilitation clinical advisor: Elsevier Health Sciences; 2009.
  • Powden CJ, Hoch MC, Hoch JM. Examination of response shift after rehabilitation for orthopedic conditions: a systematic review. J Sport Rehabil. 2018;27(5):469-479. doi:10.1123/jsr.2017-0104
  • Sahin Ozdemir M, Ozdemir YE. Comparison of the performances between ChatGPT and Gemini in answering questions on viral hepatitis. Sci Rep. 2025;15(1):1712. doi:10.1038/s41598-024-83575-1
  • Erol E, Arıkan H. Does ChatGPT provide comprehensive and accurate information regarding the effects, types and programming of core exercises? Turk J Kinesiol. 2024;10(3):178-182. doi:10.31459/turkjkin. 1516614
  • Rossettini G, Bargeri S, Cook C, et al. Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: crosssectional study. Front Digit Health. 2025;7:1574287. doi:10.3389/fdgth.2025.1574287
  • Marcaccini G, Seth I, Xie Y, et al. Breaking bones, breaking barriers: ChatGPT, DeepSeek, and Gemini in hand fracture management. J Clin Med. 2025;14(6):1983. doi:10.3390/jcm14061983
  • Boone Jr HN, Boone DA. Analyzing likert data. J Extension. 2012;50(2): 48. doi:10.34068/joe.50.02.48
  • Sidiq M, Sharma J. ChatGPT in physiotherapy research and education: a boon or bane?-an overview. Eur J Physiother. 2025:1-3. doi:10.1080/21679169.2025.2513354
  • Safran E, Yildirim S. A cross-sectional study on ChatGPT’s alignment with clinical practice guidelines in musculoskeletal rehabilitation. BMC Musculoskelet Disord. 2025;26(1):411. doi:10.1186/s12891-025-08650-8
  • Naqvi WM, Shaikh SZ, Mishra GV. Large language models in physical therapy: time to adapt and adept. Front Public Health. 2024;12:1364660. doi:10.3389/fpubh.2024.1364660
  • Hao J, Yao Z, Tang Y, Remis A, Wu K, Yu X. Artificial intelligence in physical therapy: evaluating ChatGPT’s role in clinical decision support for musculoskeletal care. Ann Biomed Eng. 2025;53(1):9-13. doi:10.1007/s10439-025-03676-4
  • Bilika P, Stefanouli V, Strimpakos N, Kapreli EV. Clinical reasoning using ChatGPT: is it beyond credibility for physiotherapists use? Physiother Theory Pract. 2024;40(12):2943-2962. doi:10.1080/09593985. 2023.2291656
  • Arbel Y, Gimmon Y, Shmueli L. Evaluating the potential of large language models for vestibular rehabilitation education: a comparison of ChatGPT, Google Gemini, and Clinicians. Phys Ther. 2025;105(4): pzaf010. doi:10.1093/ptj/pzaf010
  • Kleebayoon A, Wiwanitkit V. On “evaluating the potential of large language models for vestibular rehabilitation education: a comparison of ChatGPT, Google Gemini, and Clinicians.” Phys Ther. 2025;105(4): pzaf010. doi:10.1093/ptj/pzaf010
  • Gürses ÖA, Özüdoğru A, Tuncay F, Kararti C. The role of artificial intelligence large language models in personalized rehabilitation programs for knee osteoarthritis: an observational study. J Med Syst. 2025;49(1):73. doi:10.1007/s10916-025-02207-x
  • Puce L, Bragazzi NL, Curra A, Trompetto C. Harnessing generative artificial intelligence for exercise and training prescription: applications and implications in sports and physical activity-a systematic literature review. Appl Sci (2076-3417). 2025;15:7.
  • Usen A, Kuculmez O. Evaluation of the performance of large language models in the management of axial spondyloarthropathy: analysis of EULAR 2022 recommendations. Diagnostics. 2025;15(12):1455. doi:10. 3390/diagnostics15121455
  • Wu H, Yao S, Bao H, Guo Y, Xu C, Ma J. ChatGPT-4.0 and DeepSeek-R1 does not yet provide clinically supported answers for knee osteoarthritis. Knee. 2025;56:386-396. doi:10.1016/j.knee.2025.06.007
  • Gültekin O, Inoue J, Yilmaz B, et al. Evaluating DeepResearch and DeepThink in anterior cruciate ligament surgery patient education: ChatGPT-4o excels in comprehensiveness, DeepSeek R1 leads in clarity and readability of orthopaedic information. Knee Surg Sports Traumatol Arthrosc. 2025. doi:10.1002/ksa.12711
  • Saglam S, Uludag V, Karaduman ZO, Arıcan M, Yücel MO, Dalaslan RE. Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study. BMC Med Inform Decis Mak. 2025;25(1):163. doi:10.1186/s12911-025-02996-8
  • Karaagac M, Carkit S. Evaluation of AI-based chatbots in liver cancer information dissemination: a comparative analysis of GPT, DeepSeek, Copilot, and Gemini. Oncology. 2025. doi:10.1159/000546726

Yapay Zeka vs Yapay Zeka: ortopedik rehabilitasyonda dil modellerinin klinik akıl yürütme performansı

Yıl 2025, Cilt: 8 Sayı: 5, 825 - 831, 16.09.2025
https://doi.org/10.32322/jhsm.1743257

Öz

Amaç:
Bu çalışma, üç gelişmiş büyük dil modeli (ChatGPT-4o, Gemini 2.5 Pro ve DeepSeek-V3)’nun ortopedik rehabilitasyondaki klinik akıl yürütme ve tedavi planlama performanslarını karşılaştırmayı amaçlamıştır. Standartlaştırılmış klinik senaryolara verdikleri yanıtların, kanıta dayalı fizyoterapi uygulamalarıyla uyumunu değerlendirmek için alaka, doğruluk, tamamlayıcılık, uygulanabilirlik ve güvenlik farkındalığı açısından analiz edilmiştir.

Yöntem:
Rotator manşet tendinopatisi, lomber disk herniasyonu ve ön çapraz bağ (ÖÇB) rekonstrüksiyonunu içeren üç kurgusal fakat klinik gerçekçiliğe sahip senaryo deneyimli bir fizyoterapist tarafından hazırlanmıştır. Senaryolar aynı gün ve standart prompt kullanılarak üç yapay zeka modeli tarafından bağımsız şekilde yanıtlanmıştır. Her modelin yanıtları, uzman bir fizyoterapist tarafından beş alan (klinik doğruluk, alaka, tamamlayıcılık, uygulanabilirlik, güvenlik farkındalığı) bazında 5 puanlık Likert ölçeğiyle değerlendirilmiştir. Ortalama puanlar ve tanımlayıcı istatistikler hesaplanmıştır.

Bulgular:
DeepSeek-V3 tüm senaryolarda ve kriterlerde mükemmel (5/5) skorlar alarak en kapsamlı ve klinik olarak tutarlı planları sunmuştur. ChatGPT-4o genel olarak güçlü performans sergilemiş (toplam puan 19-20/25 aralığında) ancak tamamlayıcılık alanında daha düşük puanlar almıştır. Gemini 2.5 Pro ise uygulama ve klinik alaka açısından en düşük skorları (ortalama 18/25) göstermiştir, özellikle lomber disk herniasyonu senaryosunda zayıf kalmıştır. Tüm modeller ağrı yönetimi, postüral düzeltme ve kademeli güçlendirme gibi kanıta dayalı yaklaşımlar sunmuştur. Farklılıklar daha çok yaşam tarzı değişiklikleri, hasta eğitimi ve psikososyal bileşenlerin entegrasyonunda gözlenmiştir; Gemini psikolojik hazırlık gibi unsurları daha belirgin şekilde ele almıştır.

Sonuç:
Yapay zeka tabanlı rehabilitasyon planları mevcut fizyoterapi rehberleri ile yüksek uyum gösterse de detay ve klinik uygulanabilirlik açısından farklılıklar bulunmaktadır. DeepSeek-V3 tutarlılık ve güvenlik önlemlerinde üstünlük sağlarken, ChatGPT-4o klinik doğruluk ile orta düzeyde tamamlayıcılık arasında denge kurmuştur. Gemini 2.5 Pro’nun biyopsikososyal yaklaşımları klinik derinlik sunsa da uygulamada geliştirilmesi gerekebilir. Bu bulgular, yapay zeka araçlarının ortopedik rehabilitasyonda kullanım potansiyeli ve sınırlamalarına ışık tutmakta olup, model seçiminin klinik bağlam ve kullanıcı deneyimine göre yapılması gerektiğini göstermektedir.

Kaynakça

  • Zhou Z. Evaluation of ChatGPT’s capabilities in medical report generation. Cureus. 2023;15(4):e37589. doi:10.7759/cureus.37589
  • Onan D, Arıkan H, Can İ, Güven Ş, Işıkay L, Ozge A. Examining the ability of artificial intelligence with ChatGPT-4.0 to create an exercise program: case scenario examples” lumbar disc herniation, chronic migraine, and urge urinary incontinence”. Turk J Kinesiol. 2025;11(1): 28-44.
  • Li J, Dada A, Puladi B, Kleesiek J, Egger J. ChatGPT in healthcare: a taxonomy and systematic review. Comput Method Progr Biomed. 2024; 245:108013. doi:10.1016/j.cmpb.2024.108013
  • Tustumi F, Andreollo NA, Aguilar-Nascimento JEd. Future of the language models in healthcare: the role of ChatGPT. Arq Bras Cir Dig. 2023;36:e1727. doi:10.1590/0102-672020230002e1727
  • Vaishya R, Misra A, Vaish A. ChatGPT: is this version good for healthcare and research? Diabetes Metabol Syndr. 2023;17(4):102744. doi:10.1016/j.dsx.2023.102744
  • Wang C, Liu S, Yang H, Guo J, Wu Y, Liu J. Ethical considerations of using ChatGPT in health care. J Med Internet Res. 2023;25:e48009. doi: 10.2196/48009
  • Giorgino R, Alessandri-Bonetti M, Del Re M, Verdoni F, Peretti GM, Mangiavini L. Google Bard and ChatGPT in orthopedics: which is the better doctor in sports medicine and pediatric orthopedics? The role of AI in patient education. Diagnostics. 2024;14(12):1253. doi:10.3390/diagnostics14121253
  • Artioli E, Veronesi F, Mazzotti A, et al. Assessing ChatGPT responses to common patient questions regarding total ankle arthroplasty. J Exp Orthop. 2025;12(1):e70138. doi:10.1002/jeo2.70138
  • Giangarra CE, Manske RC. Clinical orthopaedic rehabilitation: a team approach E-book: Elsevier Health Sciences; 2017.
  • Sueki D, Brechter J. Orthopedic rehabilitation clinical advisor: Elsevier Health Sciences; 2009.
  • Powden CJ, Hoch MC, Hoch JM. Examination of response shift after rehabilitation for orthopedic conditions: a systematic review. J Sport Rehabil. 2018;27(5):469-479. doi:10.1123/jsr.2017-0104
  • Sahin Ozdemir M, Ozdemir YE. Comparison of the performances between ChatGPT and Gemini in answering questions on viral hepatitis. Sci Rep. 2025;15(1):1712. doi:10.1038/s41598-024-83575-1
  • Erol E, Arıkan H. Does ChatGPT provide comprehensive and accurate information regarding the effects, types and programming of core exercises? Turk J Kinesiol. 2024;10(3):178-182. doi:10.31459/turkjkin. 1516614
  • Rossettini G, Bargeri S, Cook C, et al. Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: crosssectional study. Front Digit Health. 2025;7:1574287. doi:10.3389/fdgth.2025.1574287
  • Marcaccini G, Seth I, Xie Y, et al. Breaking bones, breaking barriers: ChatGPT, DeepSeek, and Gemini in hand fracture management. J Clin Med. 2025;14(6):1983. doi:10.3390/jcm14061983
  • Boone Jr HN, Boone DA. Analyzing likert data. J Extension. 2012;50(2): 48. doi:10.34068/joe.50.02.48
  • Sidiq M, Sharma J. ChatGPT in physiotherapy research and education: a boon or bane?-an overview. Eur J Physiother. 2025:1-3. doi:10.1080/21679169.2025.2513354
  • Safran E, Yildirim S. A cross-sectional study on ChatGPT’s alignment with clinical practice guidelines in musculoskeletal rehabilitation. BMC Musculoskelet Disord. 2025;26(1):411. doi:10.1186/s12891-025-08650-8
  • Naqvi WM, Shaikh SZ, Mishra GV. Large language models in physical therapy: time to adapt and adept. Front Public Health. 2024;12:1364660. doi:10.3389/fpubh.2024.1364660
  • Hao J, Yao Z, Tang Y, Remis A, Wu K, Yu X. Artificial intelligence in physical therapy: evaluating ChatGPT’s role in clinical decision support for musculoskeletal care. Ann Biomed Eng. 2025;53(1):9-13. doi:10.1007/s10439-025-03676-4
  • Bilika P, Stefanouli V, Strimpakos N, Kapreli EV. Clinical reasoning using ChatGPT: is it beyond credibility for physiotherapists use? Physiother Theory Pract. 2024;40(12):2943-2962. doi:10.1080/09593985. 2023.2291656
  • Arbel Y, Gimmon Y, Shmueli L. Evaluating the potential of large language models for vestibular rehabilitation education: a comparison of ChatGPT, Google Gemini, and Clinicians. Phys Ther. 2025;105(4): pzaf010. doi:10.1093/ptj/pzaf010
  • Kleebayoon A, Wiwanitkit V. On “evaluating the potential of large language models for vestibular rehabilitation education: a comparison of ChatGPT, Google Gemini, and Clinicians.” Phys Ther. 2025;105(4): pzaf010. doi:10.1093/ptj/pzaf010
  • Gürses ÖA, Özüdoğru A, Tuncay F, Kararti C. The role of artificial intelligence large language models in personalized rehabilitation programs for knee osteoarthritis: an observational study. J Med Syst. 2025;49(1):73. doi:10.1007/s10916-025-02207-x
  • Puce L, Bragazzi NL, Curra A, Trompetto C. Harnessing generative artificial intelligence for exercise and training prescription: applications and implications in sports and physical activity-a systematic literature review. Appl Sci (2076-3417). 2025;15:7.
  • Usen A, Kuculmez O. Evaluation of the performance of large language models in the management of axial spondyloarthropathy: analysis of EULAR 2022 recommendations. Diagnostics. 2025;15(12):1455. doi:10. 3390/diagnostics15121455
  • Wu H, Yao S, Bao H, Guo Y, Xu C, Ma J. ChatGPT-4.0 and DeepSeek-R1 does not yet provide clinically supported answers for knee osteoarthritis. Knee. 2025;56:386-396. doi:10.1016/j.knee.2025.06.007
  • Gültekin O, Inoue J, Yilmaz B, et al. Evaluating DeepResearch and DeepThink in anterior cruciate ligament surgery patient education: ChatGPT-4o excels in comprehensiveness, DeepSeek R1 leads in clarity and readability of orthopaedic information. Knee Surg Sports Traumatol Arthrosc. 2025. doi:10.1002/ksa.12711
  • Saglam S, Uludag V, Karaduman ZO, Arıcan M, Yücel MO, Dalaslan RE. Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study. BMC Med Inform Decis Mak. 2025;25(1):163. doi:10.1186/s12911-025-02996-8
  • Karaagac M, Carkit S. Evaluation of AI-based chatbots in liver cancer information dissemination: a comparative analysis of GPT, DeepSeek, Copilot, and Gemini. Oncology. 2025. doi:10.1159/000546726
Toplam 30 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Fizyoterapi
Bölüm Orijinal Makale
Yazarlar

Ertuğrul Safran 0000-0002-6835-5428

Yusuf Yaşasın 0009-0004-7193-7252

Yayımlanma Tarihi 16 Eylül 2025
Gönderilme Tarihi 15 Temmuz 2025
Kabul Tarihi 6 Ağustos 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 8 Sayı: 5

Kaynak Göster

AMA Safran E, Yaşasın Y. AI vs AI: clinical reasoning performance of language models in orthopedic rehabilitation. J Health Sci Med /JHSM /jhsm. Eylül 2025;8(5):825-831. doi:10.32322/jhsm.1743257

Üniversitelerarası Kurul (ÜAK) Eşdeğerliği:  Ulakbim TR Dizin'de olan dergilerde yayımlanan makale [10 PUAN] ve 1a, b, c hariç  uluslararası indekslerde (1d) olan dergilerde yayımlanan makale [5 PUAN]

Dahil olduğumuz İndeksler (Dizinler) ve Platformlar sayfanın en altındadır.

Not:
Dergimiz WOS indeksli değildir ve bu nedenle Q olarak sınıflandırılmamıştır.

Yüksek Öğretim Kurumu (YÖK) kriterlerine göre yağmacı/şüpheli dergiler hakkındaki kararları ile yazar aydınlatma metni ve dergi ücretlendirme politikasını tarayıcınızdan indirebilirsiniz. https://dergipark.org.tr/tr/journal/2316/file/4905/show 


Dergi Dizin ve Platformları

Dizinler; ULAKBİM TR Dizin, Index Copernicus, ICI World of Journals, DOAJ, Directory of Research Journals Indexing (DRJI), General Impact Factor, ASOS Index, WorldCat (OCLC), MIAR, EuroPub, OpenAIRE, Türkiye Citation Index, Türk Medline Index, InfoBase Index, Scilit, vs.

Platformlar; Google Scholar, CrossRef (DOI), ResearchBib, Open Access, COPE, ICMJE, NCBI, ORCID, Creative Commons vs.