Araştırma Makalesi

Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset

Cilt: 20 Sayı: 72 12 Aralık 2025
PDF İndir
TR EN

Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset

Öz

This paper focuses on developing and implementing a biomedical question-answering (BQA) system using large language models (LLMs) and the CliCR dataset, in combination with the LangChain framework. The study evaluates several models, including GPT-3.5, GPT-4, LLAMA3, and Mistral, in handling clinical questions. Key methodologies include data preparation, prompt engineering, and model adaptation. The evaluation employs metrics such as precision, recall, F1-score, BLEU scores, and embedding-based metrics. Results show that using the entire case context significantly outperforms chunking and vector store indexing methods. Notably, GPT-4 achieved an exact match score of 44.7%, surpassing human experts. Although fine-tuning improves domain-specific performance, there's a risk of overfitting. This research adds to the progress in BQA systems with possible benefits for clinical decision-making and medical education.

Anahtar Kelimeler

Kaynakça

  1. Buckley, T. A., Diao, J. A., Rodman, A., & Manrai, A. K. (2023). Accuracy of a vision-language model on challenging medical cases. ArXiv, abs/2311.05591. https://arxiv.org/abs/2311.05591
  2. Chowdhury, M., Lim, E., Higham, A., McKinnon, R., Ventoura, N., He, Y., & Pennington, N.D. (2023). Can large language models safely address patient questions following cataract surgery? Clinical NLP, 1, 131–137. https://aclanthology.org/2023.clinicalnlp-1.17
  3. Duong, D., & Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. medRxiv: The Preprint Server for Health Sciences. https://www.medrxiv.org/ content/10.1101/2023.01.27.23285115v1
  4. Gu, Y., Tinn, R., Cheng, H., Lucas, M. R., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2020). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3, 1–23. https:// arxiv.org/abs/2007.15779
  5. Hiesinger, W., Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J., Moor, M., Alexander,K., Ashley, E. A., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., & Nelson, J. (2023). Almanac: Retrieval-augmented language models for clinical medicine. Research Square. https://arxiv.org/abs/2303.01229
  6. Holmes, J., Liu, Z., Zhang, L.-C., Ding, Y., Sio, T., Mcgee, L., Ashman, J., Li, X., Liu, T., Shen, J., & Liu, W. (2023). Evaluating large language models on a highly-specialized topic, radiation oncology physics. Frontiers in Oncology, 13. https://arxiv.org/abs/2304.01938
  7. Holmes, J., Peng, R., Li, Y., Hu, J., Liu, Z., Wu, Z., Zhao, H., Jiang, X., Liu, W., Wei, H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating multiple large language models in pediatric ophthalmology. ArXiv, abs/2311.04368. https://arxiv.org/abs/2311.04368
  8. Holmes, J., Ye, S., Li, Y., Wu, S.-N., Liu, Z., Wu, Z., Hu, J., Zhao, H., Jiang, X., Liu, W., Wei,H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating large language models in ophthalmology. ArXiv, abs/2311.04933. https://arxiv.org/abs/2311.04933

Ayrıntılar

Birincil Dil

İngilizce

Konular

Doğal Dil İşleme

Bölüm

Araştırma Makalesi

Yayımlanma Tarihi

12 Aralık 2025

Gönderilme Tarihi

8 Ağustos 2025

Kabul Tarihi

18 Kasım 2025

Yayımlandığı Sayı

Yıl 2025 Cilt: 20 Sayı: 72

Kaynak Göster

APA
Almannaa, F., & Sönmez, F. (2025). Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi, 20(72), 167-188. https://izlik.org/JA33MT76KD
AMA
1.Almannaa F, Sönmez F. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi. 2025;20(72):167-188. https://izlik.org/JA33MT76KD
Chicago
Almannaa, Feras, ve Ferdi Sönmez. 2025. “Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi 20 (72): 167-88. https://izlik.org/JA33MT76KD.
EndNote
Almannaa F, Sönmez F (01 Aralık 2025) Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi 20 72 167–188.
IEEE
[1]F. Almannaa ve F. Sönmez, “Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset”, ABMYO Dergisi, c. 20, sy 72, ss. 167–188, Ara. 2025, [çevrimiçi]. Erişim adresi: https://izlik.org/JA33MT76KD
ISNAD
Almannaa, Feras - Sönmez, Ferdi. “Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi 20/72 (01 Aralık 2025): 167-188. https://izlik.org/JA33MT76KD.
JAMA
1.Almannaa F, Sönmez F. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi. 2025;20:167–188.
MLA
Almannaa, Feras, ve Ferdi Sönmez. “Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset”. Anadolu Bil Meslek Yüksekokulu Dergisi, c. 20, sy 72, Aralık 2025, ss. 167-88, https://izlik.org/JA33MT76KD.
Vancouver
1.Feras Almannaa, Ferdi Sönmez. Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. ABMYO Dergisi [Internet]. 01 Aralık 2025;20(72):167-88. Erişim adresi: https://izlik.org/JA33MT76KD


All site content, except where otherwise noted, is licensed under a Creative Common Attribution Licence. (CC-BY-NC 4.0)

by-nc.png