Araştırma Makalesi

Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications

Cilt: 29 Sayı: 1 3 Mart 2026
PDF İndir
TR EN

Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications

Öz

The growing use of Large Language Models (LLMs) in healthcare raises important questions about the need for domain-specific training in medical applications. This study presents a detailed evaluation of medical-domain and general-purpose LLMs using three medical datasets (PubMedQA, BioASQ, and WikiDoc), which contain approximately 11,000 question-answer pairs. We evaluated four medical-domain models (Meditron-7B, BioMistral-7B, MedAlpaca-13B, and PMC-LLaMA-13B) against four general-purpose instruction-tuned models (Ministral-8B-Instruct, Gemma 2-9B-it, Vicuna-13B v1.5, and Llama 3-8B-Instruct). Across 182,944 prompts in both zero-shot and few-shot settings, our findings show that general-purpose models consistently outperformed their medical-specific counterparts on all evaluation metrics. Specifically, Ministral-8B-Instruct achieved the highest performance in few-shot settings with a BERTScore of 0.613, SimCSE of 0.764, and semantic similarity of 0.684. These scores were significantly higher than those of the best medical model, BioMistral-7B (0.545, 0.678, and 0.533, respectively). Furthermore, zero-shot performance often matched or surpassed few-shot results, as seen with Llama-3-8B-Instruct achieving a SimCSE score of 0.794. These findings challenge the common assumption that domain-specific pretraining is required for optimal performance in specialized tasks and have major implications for how resources are allocated in healthcare AI development.

Anahtar Kelimeler

Kaynakça

  1. [1] Nori H., King N., McKinney S. M., Carignan D. and Horvitz E., "Capabilities of GPT-4 on medical challenge problems," arXiv preprint arXiv:2303.13375, (2023).
  2. [2] Jin Q., Dhingra B., Liu Z., Cohen W. and Lu X., "PubMedQA: A Dataset for Biomedical Research Question Answering," Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2567-2577, (2019).
  3. [3] Krithara A., Nentidis A., Bougiatiotis K. and Paliouras G., "BioASQ-QA: A manually curated corpus for Biomedical Question Answering," Scientific Data, 10: 170, (2023).
  4. [4] "WikiDoc Medical Encyclopedia," [Online], Available: https://www.wikidoc.org/index.php/Main_Page.
  5. [5] Singhal K. et al., "Towards Expert-Level Medical Question Answering with Large Language Models," arXiv preprint arXiv:2305.09617, (2023).
  6. [6] Hong Z., Ajith A., Pauloski J., Duede E., Chard K. and Foster I., "The diminishing returns of masked language models to science," Findings of the Association for Computational Linguistics: ACL 2023, 1270-1283, (2023).
  7. [7] Mustafa M. A., Erdem O. A. and Söğüt E., "Use of Chest X-ray Images and Artificial Intelligence Methods for Early Diagnosis of COVID-19," Politeknik Dergisi, (2025).
  8. [8] Eriç A., Özgür E. G., Asker Ö. F. and Bekiroğlu N., "ChatGPT ve Sağlık Bilimlerinde Kullanımı," Celal Bayar Üniversitesi Sağlık Bilimleri Enstitüsü Dergisi, 11: 176-182, (2024).

Ayrıntılar

Birincil Dil

İngilizce

Konular

Doğal Dil İşleme, Yapay Zeka (Diğer)

Bölüm

Araştırma Makalesi

Erken Görünüm Tarihi

2 Ocak 2026

Yayımlanma Tarihi

3 Mart 2026

Gönderilme Tarihi

13 Haziran 2025

Kabul Tarihi

4 Aralık 2025

Yayımlandığı Sayı

Yıl 2026 Cilt: 29 Sayı: 1

Kaynak Göster

APA
Roxas, D. Q., & Emekci, H. (2026). Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi, 29(1), 1-7. https://doi.org/10.2339/politeknik.1719005
AMA
1.Roxas DQ, Emekci H. Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi. 2026;29(1):1-7. doi:10.2339/politeknik.1719005
Chicago
Roxas, Daniel Quillan, ve Hakan Emekci. 2026. “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”. Politeknik Dergisi 29 (1): 1-7. https://doi.org/10.2339/politeknik.1719005.
EndNote
Roxas DQ, Emekci H (01 Mart 2026) Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi 29 1 1–7.
IEEE
[1]D. Q. Roxas ve H. Emekci, “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”, Politeknik Dergisi, c. 29, sy 1, ss. 1–7, Mar. 2026, doi: 10.2339/politeknik.1719005.
ISNAD
Roxas, Daniel Quillan - Emekci, Hakan. “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”. Politeknik Dergisi 29/1 (01 Mart 2026): 1-7. https://doi.org/10.2339/politeknik.1719005.
JAMA
1.Roxas DQ, Emekci H. Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi. 2026;29:1–7.
MLA
Roxas, Daniel Quillan, ve Hakan Emekci. “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”. Politeknik Dergisi, c. 29, sy 1, Mart 2026, ss. 1-7, doi:10.2339/politeknik.1719005.
Vancouver
1.Daniel Quillan Roxas, Hakan Emekci. Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi. 01 Mart 2026;29(1):1-7. doi:10.2339/politeknik.1719005
 
TARANDIĞIMIZ DİZİNLER (ABSTRACTING / INDEXING)
181341319013191 13189 13187 13188 18016 

download Bu eser Creative Commons Atıf-AynıLisanslaPaylaş 4.0 Uluslararası ile lisanslanmıştır.