Research Article

Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset

Volume: 20 Number: 72 December 12, 2025
TR EN

Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset

Abstract

This paper focuses on developing and implementing a biomedical question-answering (BQA) system using large language models (LLMs) and the CliCR dataset, in combination with the LangChain framework. The study evaluates several models, including GPT-3.5, GPT-4, LLAMA3, and Mistral, in handling clinical questions. Key methodologies include data preparation, prompt engineering, and model adaptation. The evaluation employs metrics such as precision, recall, F1-score, BLEU scores, and embedding-based metrics. Results show that using the entire case context significantly outperforms chunking and vector store indexing methods. Notably, GPT-4 achieved an exact match score of 44.7%, surpassing human experts. Although fine-tuning improves domain-specific performance, there's a risk of overfitting. This research adds to the progress in BQA systems with possible benefits for clinical decision-making and medical education.

Keywords

References

  1. Buckley, T. A., Diao, J. A., Rodman, A., & Manrai, A. K. (2023). Accuracy of a vision-language model on challenging medical cases. ArXiv, abs/2311.05591. https://arxiv.org/abs/2311.05591
  2. Chowdhury, M., Lim, E., Higham, A., McKinnon, R., Ventoura, N., He, Y., & Pennington, N.D. (2023). Can large language models safely address patient questions following cataract surgery? Clinical NLP, 1, 131–137. https://aclanthology.org/2023.clinicalnlp-1.17
  3. Duong, D., & Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. medRxiv: The Preprint Server for Health Sciences. https://www.medrxiv.org/ content/10.1101/2023.01.27.23285115v1
  4. Gu, Y., Tinn, R., Cheng, H., Lucas, M. R., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2020). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3, 1–23. https:// arxiv.org/abs/2007.15779
  5. Hiesinger, W., Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J., Moor, M., Alexander,K., Ashley, E. A., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., & Nelson, J. (2023). Almanac: Retrieval-augmented language models for clinical medicine. Research Square. https://arxiv.org/abs/2303.01229
  6. Holmes, J., Liu, Z., Zhang, L.-C., Ding, Y., Sio, T., Mcgee, L., Ashman, J., Li, X., Liu, T., Shen, J., & Liu, W. (2023). Evaluating large language models on a highly-specialized topic, radiation oncology physics. Frontiers in Oncology, 13. https://arxiv.org/abs/2304.01938
  7. Holmes, J., Peng, R., Li, Y., Hu, J., Liu, Z., Wu, Z., Zhao, H., Jiang, X., Liu, W., Wei, H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating multiple large language models in pediatric ophthalmology. ArXiv, abs/2311.04368. https://arxiv.org/abs/2311.04368
  8. Holmes, J., Ye, S., Li, Y., Wu, S.-N., Liu, Z., Wu, Z., Hu, J., Zhao, H., Jiang, X., Liu, W., Wei,H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating large language models in ophthalmology. ArXiv, abs/2311.04933. https://arxiv.org/abs/2311.04933

Details

Primary Language

English

Subjects

Natural Language Processing

Journal Section

Research Article

Publication Date

December 12, 2025

Submission Date

August 8, 2025

Acceptance Date

November 18, 2025

Published in Issue

Year 2025 Volume: 20 Number: 72

APA
Almannaa, F., & Sönmez, F. (2025). Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi, 20(72), 167-188. https://izlik.org/JA33MT76KD
AMA
1.Almannaa F, Sönmez F. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi. 2025;20(72):167-188. https://izlik.org/JA33MT76KD
Chicago
Almannaa, Feras, and Ferdi Sönmez. 2025. “Assessing Language Model Performance in Biomedical Question-Answering: A Case Study Using the Langchain Framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi 20 (72): 167-88. https://izlik.org/JA33MT76KD.
EndNote
Almannaa F, Sönmez F (December 1, 2025) Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi 20 72 167–188.
IEEE
[1]F. Almannaa and F. Sönmez, “Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset”, ABMYO Dergisi, vol. 20, no. 72, pp. 167–188, Dec. 2025, [Online]. Available: https://izlik.org/JA33MT76KD
ISNAD
Almannaa, Feras - Sönmez, Ferdi. “Assessing Language Model Performance in Biomedical Question-Answering: A Case Study Using the Langchain Framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi 20/72 (December 1, 2025): 167-188. https://izlik.org/JA33MT76KD.
JAMA
1.Almannaa F, Sönmez F. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi. 2025;20:167–188.
MLA
Almannaa, Feras, and Ferdi Sönmez. “Assessing Language Model Performance in Biomedical Question-Answering: A Case Study Using the Langchain Framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi, vol. 20, no. 72, Dec. 2025, pp. 167-88, https://izlik.org/JA33MT76KD.
Vancouver
1.Feras Almannaa, Ferdi Sönmez. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi [Internet]. 2025 Dec. 1;20(72):167-88. Available from: https://izlik.org/JA33MT76KD



All site content, except where otherwise noted, is licensed under a Creative Common Attribution Licence. (CC-BY-NC 4.0)