Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset

Feras Almannaa; Ferdi Sönmez

TR EN

Biyomedikal Soru-Cevaplamada Dil Modeli Performansının Değerlendirilmesi: CliCR Veri Seti Üzerinde LangChain Çatısı Kullanılarak Bir Vaka Çalışması

Abstract

Bu makale, LangChain çatısı ile birlikte büyük dil modellerini (BDM'ler) ve CliCR veri setini kullanarak bir biyomedikal soru-cevaplama (BSO) sistemi geliştirmeye ve uygulamaya odaklanmaktadır. Çalışma, GPT-3.5, GPT-4, LLAMA3 ve Mistral dahil olmak üzere çeşitli modellerin klinik soruları ele alma performansını değerlendirmektedir. Temel metodolojiler arasında veri hazırlama, komut mühendisliği (prompt engineering) ve model adaptasyonu yer almaktadır. Değerlendirmede kesinlik (precision), duyarlılık (recall), F1 skoru, BLEU skorları ve gömme (embedding) tabanlı metrikler gibi ölçütler kullanılmaktadır. Sonuçlar, vaka bağlamının tamamını kullanmanın, parçalara ayırma (chunking) ve vektör deposu indeksleme yöntemlerine göre önemli ölçüde daha iyi performans gösterdiğini ortaya koymaktadır. Dikkat çekici bir şekilde, GPT-4, %44,7'lik bir tam eşleşme skoru elde ederek insan uzmanları geride bırakmıştır. İnce ayar (fine-tuning) alana özgü performansı artırsa da, aşırı öğrenme (overfitting) riski taşımaktadır. Bu araştırma, klinik karar verme ve tıp eğitimi için potansiyel faydalar sunarak BSO sistemlerindeki ilerlemelere katkıda bulunmaktadır.

Keywords

Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset

Abstract

This paper focuses on developing and implementing a biomedical question-answering (BQA) system using large language models (LLMs) and the CliCR dataset, in combination with the LangChain framework. The study evaluates several models, including GPT-3.5, GPT-4, LLAMA3, and Mistral, in handling clinical questions. Key methodologies include data preparation, prompt engineering, and model adaptation. The evaluation employs metrics such as precision, recall, F1-score, BLEU scores, and embedding-based metrics. Results show that using the entire case context significantly outperforms chunking and vector store indexing methods. Notably, GPT-4 achieved an exact match score of 44.7%, surpassing human experts. Although fine-tuning improves domain-specific performance, there's a risk of overfitting. This research adds to the progress in BQA systems with possible benefits for clinical decision-making and medical education.

Keywords

References

Buckley, T. A., Diao, J. A., Rodman, A., & Manrai, A. K. (2023). Accuracy of a vision-language model on challenging medical cases. ArXiv, abs/2311.05591. https://arxiv.org/abs/2311.05591
Chowdhury, M., Lim, E., Higham, A., McKinnon, R., Ventoura, N., He, Y., & Pennington, N.D. (2023). Can large language models safely address patient questions following cataract surgery? Clinical NLP, 1, 131–137. https://aclanthology.org/2023.clinicalnlp-1.17
Duong, D., & Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. medRxiv: The Preprint Server for Health Sciences. https://www.medrxiv.org/ content/10.1101/2023.01.27.23285115v1
Gu, Y., Tinn, R., Cheng, H., Lucas, M. R., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2020). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3, 1–23. https:// arxiv.org/abs/2007.15779
Hiesinger, W., Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J., Moor, M., Alexander,K., Ashley, E. A., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., & Nelson, J. (2023). Almanac: Retrieval-augmented language models for clinical medicine. Research Square. https://arxiv.org/abs/2303.01229
Holmes, J., Liu, Z., Zhang, L.-C., Ding, Y., Sio, T., Mcgee, L., Ashman, J., Li, X., Liu, T., Shen, J., & Liu, W. (2023). Evaluating large language models on a highly-specialized topic, radiation oncology physics. Frontiers in Oncology, 13. https://arxiv.org/abs/2304.01938
Holmes, J., Peng, R., Li, Y., Hu, J., Liu, Z., Wu, Z., Zhao, H., Jiang, X., Liu, W., Wei, H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating multiple large language models in pediatric ophthalmology. ArXiv, abs/2311.04368. https://arxiv.org/abs/2311.04368
Holmes, J., Ye, S., Li, Y., Wu, S.-N., Liu, Z., Wu, Z., Hu, J., Zhao, H., Jiang, X., Liu, W., Wei,H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating large language models in ophthalmology. ArXiv, abs/2311.04933. https://arxiv.org/abs/2311.04933

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Sciao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. ArXiv. https://arxiv. org/abs/2310.06825
Jin, Q., Yuan, Z., Xiong, G., Yu, Q., Ying, H., Tan, C., Chen, M., Huang, S., Liu, X., & Yu, S. (2021). Biomedical question answering: A survey of approaches and challenges. ACM Computing Surveys (CSUR), 55, 1–36. https://arxiv.org/abs/2102.05281
Kalyan, K. S., Rajasekharan, A., & Sangeetha, S. (2021). AMMU: A survey of transformer- based biomedical pretrained language models. ArXiv, abs/2105.00827. https://arxiv.org/abs/2105.00827
Koga, S. (2023). Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examinationstyle questions. Pathology International, 73.https://onlinelibrary.wiley. com/doi/10.1111/pin.13382
Kumari, A., Kumari, A., Singh, A., Singh, S., Juhi, A., Dhanvijay, A., Pinjar, M., & Mondal, H. (2023). Large language models in hematology case solving: A comparative study of chatgpt- 3.5, googlebard, and microsoft bing. Cureus,15.https://onlinelibrary.wiley.com/ doi/abs/10.1111/bjh.19738
Lee, D. T., Vaid, A., Menon, B. M., Freeman, R. R., Matteson, D. S., Marin, M. P., & Nadkarni, G. N. (2023). Development of a privacy preserving large language model for automated data extraction from thyroid cancer pathology reports. medRxiv. https://www.medrxiv.org/content/10.1101/2023.11. 07.23298000v1
Li, J., Wang, S., Zhang, M., Li, W., Lai, Y., & Kang, X. (2024). Agent hospital: A simulacrum of hospital with evolvable medical agents. ArXiv. https://arxiv.org/abs/2405.02957
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, S., & Altman, S. (2024). GPT-4 Technical Report. arXiv Preprint arXiv:2303.08774. https:// arxiv.org/abs/2303.08774
Pal, R., Garg, H., Patel, S., & Sethi, T. (2023).Biasamplification in intersectional subpopulations for clinical phenotyping by large language models. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.03. 22.23287585v1
Reese, J., Danis, D., Caulfied, J. H., Casiraghi, E., Valentini, G., Mungall, C., & Robinson, P. N. (2023). On the limitations of large language models in clinical diagnosis. medRxiv. https://www.medrxiv. org/content/10.1101/2023.07. 13.23292613v1
Schubert, M., Wick, W., & Venkataramani, V. (2023). Performance of large language models on a neurology board–style examination. JAMA Network Open, 6. https://jamanetwork.com/journals/jamanetworkopen/ fullarticle/2812620
Suster, S., & Daelemans, W. (2018). CliCR: a dataset of clinical case reports for machine reading comprehension. ArXiv, abs/1803.09720. https://arxiv.org/abs/1803.09720
Tinn, R., Cheng, H., Gu, Y., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2021). Fine-tuning large neural language models for biomedical natural language processing. Patterns. https://arxiv.org/ abs/2112.07869
Titus, A. J. (2023). NHANES-GPT: Large language models (LLMs) and the future of biostatistics. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.12.13.23299830v1
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, F., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. ArXiv. https://arxiv.org/abs/2302.13971
Üstün, A., Aryabumi, V., Yong, Z.-X., Ko, W.-Y., D’souza, D., Onilude, G., Bhandari, N.,Singh, S., Ooi, H.-L., Kayid, A., Vargus, F., Blunsom, P., Longpre, S., Muennighoff, N., Fadaee, M., Kreutzer, J., & Hooker, S. (2024). Aya model: An instruction finetuned open-access multilingual language model. ArXiv. https://arxiv.org/abs/2402.07827
Wang, Z. (2022). Modern question answering datasets and benchmarks: A survey. ArXiv, abs/2206.15030. https://arxiv.org/ abs/2206.15030

Details

Primary Language

English

Subjects

Natural Language Processing

Journal Section

Research Article

Authors

Feras Almannaa ^*
0009-0005-4645-1071
Türkiye

Ferdi Sönmez
0000-0002-5761-3867
Türkiye

Publication Date

December 12, 2025

Submission Date

August 8, 2025

Acceptance Date

November 18, 2025

Published in Issue

Year 2025 Volume: 20 Number: 72

IZ

https://izlik.org/JA33MT76KD

Cite

RIS / Bibtex

APA

Almannaa, F., & Sönmez, F. (2025). Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi, 20(72), 167-188. https://izlik.org/JA33MT76KD

AMA

1.Almannaa F, Sönmez F. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi. 2025;20(72):167-188. https://izlik.org/JA33MT76KD

Chicago

Almannaa, Feras, and Ferdi Sönmez. 2025. “Assessing Language Model Performance in Biomedical Question-Answering: A Case Study Using the Langchain Framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi 20 (72): 167-88. https://izlik.org/JA33MT76KD.

EndNote

Almannaa F, Sönmez F (December 1, 2025) Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi 20 72 167–188.

IEEE

[1]F. Almannaa and F. Sönmez, “Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset”, ABMYO Dergisi, vol. 20, no. 72, pp. 167–188, Dec. 2025, [Online]. Available: https://izlik.org/JA33MT76KD

ISNAD

Almannaa, Feras - Sönmez, Ferdi. “Assessing Language Model Performance in Biomedical Question-Answering: A Case Study Using the Langchain Framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi 20/72 (December 1, 2025): 167-188. https://izlik.org/JA33MT76KD.

JAMA

1.Almannaa F, Sönmez F. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi. 2025;20:167–188.

MLA

Almannaa, Feras, and Ferdi Sönmez. “Assessing Language Model Performance in Biomedical Question-Answering: A Case Study Using the Langchain Framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi, vol. 20, no. 72, Dec. 2025, pp. 167-88, https://izlik.org/JA33MT76KD.

Vancouver

1.Feras Almannaa, Ferdi Sönmez. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi [Internet]. 2025 Dec. 1;20(72):167-88. Available from: https://izlik.org/JA33MT76KD