Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications

Daniel Quillan Roxas; Hakan Emekci

doi:10.2339/politeknik.1719005

TR EN

Tıbbi Alan ve Genel Amaçlı Büyük Dil Modellerinin Karşılaştırmalı Analizi: Sağlık Uygulamalarında Uzmanlaşmanın Değerlendirilmesi

Abstract

Büyük Dil Modellerinin (BDM’ler) sağlık alanında yaygın kullanımı, tıbbi uygulamalar için alana özgü eğitimin gerekli olup olmadığı konusunda temel soruları gündeme getirmiştir. Bu çalışma, yaklaşık 11.000 soru-cevap çiftinden oluşan üç tıbbi veri seti (PubMedQA, BioASQ ve WikiDoc) kullanarak tıbbi alan ve genel amaçlı BDM’lerin kapsamlı bir değerlendirmesini sunmaktadır. Dört tıbbi alan modeli (Meditron-7B, BioMistral-7B, MedAlpaca-13B ve PMC-LLaMA-13B) ile dört genel amaçlı yönerge ayarlı modeli (Ministral-8B-Instruct, Gemma 2-9B-it, Vicuna-13B v1.5 ve Llama 3-8B-Instruct) hem sıfır-atış hem de az-atış ayarlarında 182.944 istem üzerinde değerlendirdik. Sonuçlarımız, genel amaçlı modellerin tüm değerlendirme metriklerinde tıbbi alan modellerini tutarlı bir şekilde geride bıraktığını göstermektedir. Özellikle, Ministral-8B-Instruct, az-atış ayarlarında 0.613 BERTScore, 0.764 SimCSE ve 0.684 anlamsal benzerlik ile en yüksek genel performansı elde ederek, sırasıyla 0.545, 0.678 ve 0.533 puan alan en iyi tıbbi model (BioMistral-7B) performansını önemli ölçüde aşmıştır. Ayrıca, sıfır-atış performansı genellikle az-atış sonuçlarına eşit veya daha üstün olmuş, Llama-3-8B-Instruct sıfır-atış ayarlarında 0.794 SimCSE puanı elde etmiştir. Bu bulgular, özelleşmiş tıbbi görevlerde optimal performans için alana özgü ön eğitimin gerekli olduğu şeklindeki yaygın varsayımı sorgulamakta ve sağlık AI sistemlerinin geliştirilmesinde kaynak tahsisi açısından önemli çıkarımlar sunmaktadır.

Keywords

Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications

Abstract

The growing use of Large Language Models (LLMs) in healthcare raises important questions about the need for domain-specific training in medical applications. This study presents a detailed evaluation of medical-domain and general-purpose LLMs using three medical datasets (PubMedQA, BioASQ, and WikiDoc), which contain approximately 11,000 question-answer pairs. We evaluated four medical-domain models (Meditron-7B, BioMistral-7B, MedAlpaca-13B, and PMC-LLaMA-13B) against four general-purpose instruction-tuned models (Ministral-8B-Instruct, Gemma 2-9B-it, Vicuna-13B v1.5, and Llama 3-8B-Instruct). Across 182,944 prompts in both zero-shot and few-shot settings, our findings show that general-purpose models consistently outperformed their medical-specific counterparts on all evaluation metrics. Specifically, Ministral-8B-Instruct achieved the highest performance in few-shot settings with a BERTScore of 0.613, SimCSE of 0.764, and semantic similarity of 0.684. These scores were significantly higher than those of the best medical model, BioMistral-7B (0.545, 0.678, and 0.533, respectively). Furthermore, zero-shot performance often matched or surpassed few-shot results, as seen with Llama-3-8B-Instruct achieving a SimCSE score of 0.794. These findings challenge the common assumption that domain-specific pretraining is required for optimal performance in specialized tasks and have major implications for how resources are allocated in healthcare AI development.

Keywords

References

[1] Nori H., King N., McKinney S. M., Carignan D. and Horvitz E., "Capabilities of GPT-4 on medical challenge problems," arXiv preprint arXiv:2303.13375, (2023).
[2] Jin Q., Dhingra B., Liu Z., Cohen W. and Lu X., "PubMedQA: A Dataset for Biomedical Research Question Answering," Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2567-2577, (2019).
[3] Krithara A., Nentidis A., Bougiatiotis K. and Paliouras G., "BioASQ-QA: A manually curated corpus for Biomedical Question Answering," Scientific Data, 10: 170, (2023).
[4] "WikiDoc Medical Encyclopedia," [Online], Available: https://www.wikidoc.org/index.php/Main_Page.
[5] Singhal K. et al., "Towards Expert-Level Medical Question Answering with Large Language Models," arXiv preprint arXiv:2305.09617, (2023).
[6] Hong Z., Ajith A., Pauloski J., Duede E., Chard K. and Foster I., "The diminishing returns of masked language models to science," Findings of the Association for Computational Linguistics: ACL 2023, 1270-1283, (2023).
[7] Mustafa M. A., Erdem O. A. and Söğüt E., "Use of Chest X-ray Images and Artificial Intelligence Methods for Early Diagnosis of COVID-19," Politeknik Dergisi, (2025).
[8] Eriç A., Özgür E. G., Asker Ö. F. and Bekiroğlu N., "ChatGPT ve Sağlık Bilimlerinde Kullanımı," Celal Bayar Üniversitesi Sağlık Bilimleri Enstitüsü Dergisi, 11: 176-182, (2024).

[9] Rosa G. M. et al., "No parameter left behind: How distillation and model size affect Zero-Shot retrieval," arXiv preprint arXiv:2206.02873, (2022).
[10] Özden Gürcan G., Gokdas H. and Turan Kızıldoğan E., "Artificial Intelligence in Healthcare: Fall Risk Assessment in Older Adults Using Machine Learning Techniques," Politeknik Dergisi, (2025).
[11] Li C. and Flanigan J., "Task contamination: Language models may not be few-shot anymore," arXiv preprint arXiv:2312.16337, (2023).
[12] Herbold S., "Semantic similarity prediction is better than other semantic similarity measures," arXiv preprint arXiv:2309.12697, (2023).
[13] Koroteev M. V., "BERT: A Review of Applications in Natural Language Processing and Understanding," arXiv preprint arXiv:2103.11943, (2021).
[14] Glushkova T., Zerva C. and Martins A. F. T., "BLEU Meets COMET: Combining lexical and neural metrics towards robust machine translation evaluation," arXiv preprint arXiv:2305.19144, (2023).
[15] Lin C.-Y., "ROUGE: A package for automatic evaluation of summaries," Text Summarization Branches Out, 74-81, (2004).
[16] Papineni K., Roukos S., Ward T. and Zhu W.-J., "BLEU: A method for automatic evaluation of machine translation," Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311-318, (2002).
[17] Zhang T., Kishore V., Wu F., Weinberger K. Q. and Artzi Y., "BERTScore: Evaluating text generation with BERT," arXiv preprint arXiv:1904.09675, (2019).
[18] Reimers N. and Gurevych I., "Sentence-BERT: Sentence embeddings using Siamese BERT-networks," arXiv preprint arXiv:1908.10084, (2019).
[19] "Hugging Face - The AI community building the future," [Online], Available: https://huggingface.co/.
[20] EPFL LLM Team, "Meditron-7B," Hugging Face, [Online], (2024).
[21] BioMistral Team, "BioMistral-7B," Hugging Face, [Online], (2024).
[22] MedAlpaca Team, "MedAlpaca-13B," Hugging Face, [Online], (2024).
[23] Axiong X., "PMC-LLaMA-13B," Hugging Face, [Online], (2024).
[24] Mistral AI, "Ministral-8B-Instruct-2410," Hugging Face, [Online], (2024).
[25] Google Research, "Gemma-2-9B-it," Hugging Face, [Online], (2024).
[26] LMSYS Org, "Vicuna-13B-v1.5," Hugging Face, [Online], (2024).
[27] Meta AI, "Llama-3-8B-Instruct," Hugging Face, [Online], (2024).
[28] Gao T., Yao X. and Chen D., "SimCSE: Simple contrastive learning of sentence embeddings," arXiv preprint arXiv:2104.08821, (2021).
[29] Guo Z. and Hua Y., "Continuous training and fine-tuning for domain-specific language models in medical question answering," arXiv preprint arXiv:2311.00204, (2023).
[30] Ersöz O. Ö. et al., "Makine Öğrenmesi ile Kestirimci Bakım ve Yedek Parça Yönetimi," Politeknik Dergisi, (2025).

Details

Primary Language

English

Subjects

Natural Language Processing, Artificial Intelligence (Other)

Journal Section

Research Article

Authors

Daniel Quillan Roxas ^*
0009-0000-4484-6751
Türkiye

Hakan Emekci
0000-0002-4074-5600
Türkiye

Early Pub Date

January 2, 2026

Publication Date

March 3, 2026

Submission Date

June 13, 2025

Acceptance Date

December 4, 2025

Published in Issue

Year 2026 Volume: 29 Number: 1

DOI

https://doi.org/10.2339/politeknik.1719005

IZ

https://izlik.org/JA64GR87GT

Cite

RIS / Bibtex

APA

Roxas, D. Q., & Emekci, H. (2026). Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi, 29(1), 1-7. https://doi.org/10.2339/politeknik.1719005

AMA

1.Roxas DQ, Emekci H. Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi. 2026;29(1):1-7. doi:10.2339/politeknik.1719005

Chicago

Roxas, Daniel Quillan, and Hakan Emekci. 2026. “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”. Politeknik Dergisi 29 (1): 1-7. https://doi.org/10.2339/politeknik.1719005.

EndNote

Roxas DQ, Emekci H (March 1, 2026) Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi 29 1 1–7.

IEEE

[1]D. Q. Roxas and H. Emekci, “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”, Politeknik Dergisi, vol. 29, no. 1, pp. 1–7, Mar. 2026, doi: 10.2339/politeknik.1719005.

ISNAD

Roxas, Daniel Quillan - Emekci, Hakan. “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”. Politeknik Dergisi 29/1 (March 1, 2026): 1-7. https://doi.org/10.2339/politeknik.1719005.

JAMA

1.Roxas DQ, Emekci H. Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi. 2026;29:1–7.

MLA

Roxas, Daniel Quillan, and Hakan Emekci. “Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications”. Politeknik Dergisi, vol. 29, no. 1, Mar. 2026, pp. 1-7, doi:10.2339/politeknik.1719005.

Vancouver

1.Daniel Quillan Roxas, Hakan Emekci. Comparative Analysis of Medical-Domain and General-Purpose Large Language Models: Evaluating Specialization in Healthcare Applications. Politeknik Dergisi. 2026 Mar. 1;29(1):1-7. doi:10.2339/politeknik.1719005