TR
EN
Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset
Abstract
This paper focuses on developing and implementing a biomedical question-answering (BQA) system using large language models (LLMs) and the CliCR dataset, in combination with the LangChain framework. The study evaluates several models, including GPT-3.5, GPT-4, LLAMA3, and Mistral, in handling clinical questions. Key methodologies include data preparation, prompt engineering, and model adaptation. The evaluation employs metrics such as precision, recall, F1-score, BLEU scores, and embedding-based metrics. Results show that using the entire case context significantly outperforms chunking and vector store indexing methods. Notably, GPT-4 achieved an exact match score of 44.7%, surpassing human experts. Although fine-tuning improves domain-specific performance, there's a risk of overfitting. This research adds to the progress in BQA systems with possible benefits for clinical decision-making and medical education.
Keywords
References
- Buckley, T. A., Diao, J. A., Rodman, A., & Manrai, A. K. (2023). Accuracy of a vision-language model on challenging medical cases. ArXiv, abs/2311.05591. https://arxiv.org/abs/2311.05591
- Chowdhury, M., Lim, E., Higham, A., McKinnon, R., Ventoura, N., He, Y., & Pennington, N.D. (2023). Can large language models safely address patient questions following cataract surgery? Clinical NLP, 1, 131–137. https://aclanthology.org/2023.clinicalnlp-1.17
- Duong, D., & Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. medRxiv: The Preprint Server for Health Sciences. https://www.medrxiv.org/ content/10.1101/2023.01.27.23285115v1
- Gu, Y., Tinn, R., Cheng, H., Lucas, M. R., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2020). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3, 1–23. https:// arxiv.org/abs/2007.15779
- Hiesinger, W., Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J., Moor, M., Alexander,K., Ashley, E. A., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., & Nelson, J. (2023). Almanac: Retrieval-augmented language models for clinical medicine. Research Square. https://arxiv.org/abs/2303.01229
- Holmes, J., Liu, Z., Zhang, L.-C., Ding, Y., Sio, T., Mcgee, L., Ashman, J., Li, X., Liu, T., Shen, J., & Liu, W. (2023). Evaluating large language models on a highly-specialized topic, radiation oncology physics. Frontiers in Oncology, 13. https://arxiv.org/abs/2304.01938
- Holmes, J., Peng, R., Li, Y., Hu, J., Liu, Z., Wu, Z., Zhao, H., Jiang, X., Liu, W., Wei, H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating multiple large language models in pediatric ophthalmology. ArXiv, abs/2311.04368. https://arxiv.org/abs/2311.04368
- Holmes, J., Ye, S., Li, Y., Wu, S.-N., Liu, Z., Wu, Z., Hu, J., Zhao, H., Jiang, X., Liu, W., Wei,H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating large language models in ophthalmology. ArXiv, abs/2311.04933. https://arxiv.org/abs/2311.04933
Details
Primary Language
English
Subjects
Natural Language Processing
Journal Section
Research Article
Publication Date
December 12, 2025
Submission Date
August 8, 2025
Acceptance Date
November 18, 2025
Published in Issue
Year 2025 Volume: 20 Number: 72
APA
Almannaa, F., & Sönmez, F. (2025). Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi, 20(72), 167-188. https://izlik.org/JA33MT76KD
AMA
1.Almannaa F, Sönmez F. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi. 2025;20(72):167-188. https://izlik.org/JA33MT76KD
Chicago
Almannaa, Feras, and Ferdi Sönmez. 2025. “Assessing Language Model Performance in Biomedical Question-Answering: A Case Study Using the Langchain Framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi 20 (72): 167-88. https://izlik.org/JA33MT76KD.
EndNote
Almannaa F, Sönmez F (December 1, 2025) Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi 20 72 167–188.
IEEE
[1]F. Almannaa and F. Sönmez, “Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset”, ABMYO Dergisi, vol. 20, no. 72, pp. 167–188, Dec. 2025, [Online]. Available: https://izlik.org/JA33MT76KD
ISNAD
Almannaa, Feras - Sönmez, Ferdi. “Assessing Language Model Performance in Biomedical Question-Answering: A Case Study Using the Langchain Framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi 20/72 (December 1, 2025): 167-188. https://izlik.org/JA33MT76KD.
JAMA
1.Almannaa F, Sönmez F. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi. 2025;20:167–188.
MLA
Almannaa, Feras, and Ferdi Sönmez. “Assessing Language Model Performance in Biomedical Question-Answering: A Case Study Using the Langchain Framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi, vol. 20, no. 72, Dec. 2025, pp. 167-88, https://izlik.org/JA33MT76KD.
Vancouver
1.Feras Almannaa, Ferdi Sönmez. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi [Internet]. 2025 Dec. 1;20(72):167-88. Available from: https://izlik.org/JA33MT76KD