Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications
Abstract
The growing use of Large Language Models (LLMs) in healthcare raises important questions about the need for domain-specific training in medical applications. This study presents a detailed evaluation of medical-domain and general-purpose LLMs using three medical datasets (PubMedQA, BioASQ, and WikiDoc), which contain approximately 11,000 question-answer pairs. We evaluated four medical-domain models (Meditron-7B, BioMistral-7B, MedAlpaca-13B, and PMC-LLaMA-13B) against four general-purpose instruction-tuned models (Ministral-8B-Instruct, Gemma 2-9B-it, Vicuna-13B v1.5, and Llama 3-8B-Instruct). Across 182,944 prompts in both zero-shot and few-shot settings, our findings show that general-purpose models consistently outperformed their medical-specific counterparts on all evaluation metrics. Specifically, Ministral-8B-Instruct achieved the highest performance in few-shot settings with a BERTScore of 0.613, SimCSE of 0.764, and semantic similarity of 0.684. These scores were significantly higher than those of the best medical model, BioMistral-7B (0.545, 0.678, and 0.533, respectively). Furthermore, zero-shot performance often matched or surpassed few-shot results, as seen with Llama-3-8B-Instruct achieving a SimCSE score of 0.794. These findings challenge the common assumption that domain-specific pretraining is required for optimal performance in specialized tasks and have major implications for how resources are allocated in healthcare AI development.
Keywords
References
- [1] Nori H., King N., McKinney S. M., Carignan D. and Horvitz E., "Capabilities of GPT-4 on medical challenge problems," arXiv preprint arXiv:2303.13375, (2023).
- [2] Jin Q., Dhingra B., Liu Z., Cohen W. and Lu X., "PubMedQA: A Dataset for Biomedical Research Question Answering," Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2567-2577, (2019).
- [3] Krithara A., Nentidis A., Bougiatiotis K. and Paliouras G., "BioASQ-QA: A manually curated corpus for Biomedical Question Answering," Scientific Data, 10: 170, (2023).
- [4] "WikiDoc Medical Encyclopedia," [Online], Available: https://www.wikidoc.org/index.php/Main_Page.
- [5] Singhal K. et al., "Towards Expert-Level Medical Question Answering with Large Language Models," arXiv preprint arXiv:2305.09617, (2023).
- [6] Hong Z., Ajith A., Pauloski J., Duede E., Chard K. and Foster I., "The diminishing returns of masked language models to science," Findings of the Association for Computational Linguistics: ACL 2023, 1270-1283, (2023).
- [7] Mustafa M. A., Erdem O. A. and Söğüt E., "Use of Chest X-ray Images and Artificial Intelligence Methods for Early Diagnosis of COVID-19," Politeknik Dergisi, (2025).
- [8] Eriç A., Özgür E. G., Asker Ö. F. and Bekiroğlu N., "ChatGPT ve Sağlık Bilimlerinde Kullanımı," Celal Bayar Üniversitesi Sağlık Bilimleri Enstitüsü Dergisi, 11: 176-182, (2024).
Details
Primary Language
English
Subjects
Natural Language Processing, Artificial Intelligence (Other)
Journal Section
Research Article
Early Pub Date
January 2, 2026
Publication Date
March 3, 2026
Submission Date
June 13, 2025
Acceptance Date
December 4, 2025
Published in Issue
Year 2026 Volume: 29 Number: 1