Research Article

Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications

Volume: 29 Number: 1 March 3, 2026
TR EN

Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications

Abstract

The growing use of Large Language Models (LLMs) in healthcare raises important questions about the need for domain-specific training in medical applications. This study presents a detailed evaluation of medical-domain and general-purpose LLMs using three medical datasets (PubMedQA, BioASQ, and WikiDoc), which contain approximately 11,000 question-answer pairs. We evaluated four medical-domain models (Meditron-7B, BioMistral-7B, MedAlpaca-13B, and PMC-LLaMA-13B) against four general-purpose instruction-tuned models (Ministral-8B-Instruct, Gemma 2-9B-it, Vicuna-13B v1.5, and Llama 3-8B-Instruct). Across 182,944 prompts in both zero-shot and few-shot settings, our findings show that general-purpose models consistently outperformed their medical-specific counterparts on all evaluation metrics. Specifically, Ministral-8B-Instruct achieved the highest performance in few-shot settings with a BERTScore of 0.613, SimCSE of 0.764, and semantic similarity of 0.684. These scores were significantly higher than those of the best medical model, BioMistral-7B (0.545, 0.678, and 0.533, respectively). Furthermore, zero-shot performance often matched or surpassed few-shot results, as seen with Llama-3-8B-Instruct achieving a SimCSE score of 0.794. These findings challenge the common assumption that domain-specific pretraining is required for optimal performance in specialized tasks and have major implications for how resources are allocated in healthcare AI development.

Keywords

References

  1. [1] Nori H., King N., McKinney S. M., Carignan D. and Horvitz E., "Capabilities of GPT-4 on medical challenge problems," arXiv preprint arXiv:2303.13375, (2023).
  2. [2] Jin Q., Dhingra B., Liu Z., Cohen W. and Lu X., "PubMedQA: A Dataset for Biomedical Research Question Answering," Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2567-2577, (2019).
  3. [3] Krithara A., Nentidis A., Bougiatiotis K. and Paliouras G., "BioASQ-QA: A manually curated corpus for Biomedical Question Answering," Scientific Data, 10: 170, (2023).
  4. [4] "WikiDoc Medical Encyclopedia," [Online], Available: https://www.wikidoc.org/index.php/Main_Page.
  5. [5] Singhal K. et al., "Towards Expert-Level Medical Question Answering with Large Language Models," arXiv preprint arXiv:2305.09617, (2023).
  6. [6] Hong Z., Ajith A., Pauloski J., Duede E., Chard K. and Foster I., "The diminishing returns of masked language models to science," Findings of the Association for Computational Linguistics: ACL 2023, 1270-1283, (2023).
  7. [7] Mustafa M. A., Erdem O. A. and Söğüt E., "Use of Chest X-ray Images and Artificial Intelligence Methods for Early Diagnosis of COVID-19," Politeknik Dergisi, (2025).
  8. [8] Eriç A., Özgür E. G., Asker Ö. F. and Bekiroğlu N., "ChatGPT ve Sağlık Bilimlerinde Kullanımı," Celal Bayar Üniversitesi Sağlık Bilimleri Enstitüsü Dergisi, 11: 176-182, (2024).

Details

Primary Language

English

Subjects

Natural Language Processing, Artificial Intelligence (Other)

Journal Section

Research Article

Early Pub Date

January 2, 2026

Publication Date

March 3, 2026

Submission Date

June 13, 2025

Acceptance Date

December 4, 2025

Published in Issue

Year 2026 Volume: 29 Number: 1

APA
Roxas, D. Q., & Emekci, H. (2026). Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi, 29(1), 1-7. https://doi.org/10.2339/politeknik.1719005
AMA
1.Roxas DQ, Emekci H. Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi. 2026;29(1):1-7. doi:10.2339/politeknik.1719005
Chicago
Roxas, Daniel Quillan, and Hakan Emekci. 2026. “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”. Politeknik Dergisi 29 (1): 1-7. https://doi.org/10.2339/politeknik.1719005.
EndNote
Roxas DQ, Emekci H (March 1, 2026) Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi 29 1 1–7.
IEEE
[1]D. Q. Roxas and H. Emekci, “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”, Politeknik Dergisi, vol. 29, no. 1, pp. 1–7, Mar. 2026, doi: 10.2339/politeknik.1719005.
ISNAD
Roxas, Daniel Quillan - Emekci, Hakan. “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”. Politeknik Dergisi 29/1 (March 1, 2026): 1-7. https://doi.org/10.2339/politeknik.1719005.
JAMA
1.Roxas DQ, Emekci H. Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi. 2026;29:1–7.
MLA
Roxas, Daniel Quillan, and Hakan Emekci. “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”. Politeknik Dergisi, vol. 29, no. 1, Mar. 2026, pp. 1-7, doi:10.2339/politeknik.1719005.
Vancouver
1.Daniel Quillan Roxas, Hakan Emekci. Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi. 2026 Mar. 1;29(1):1-7. doi:10.2339/politeknik.1719005