Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset

Feras Almannaa; Ferdi Sönmez

Araştırma Makalesi

Biyomedikal Soru-Cevaplamada Dil Modeli Performansının Değerlendirilmesi: CliCR Veri Seti Üzerinde LangChain Çatısı Kullanılarak Bir Vaka Çalışması

Yıl 2025, Cilt: 20 Sayı: 72, 167 - 188, 12.12.2025

Feras Almannaa , Ferdi Sönmez

https://izlik.org/JA33MT76KD

Öz

Bu makale, LangChain çatısı ile birlikte büyük dil modellerini (BDM'ler) ve CliCR veri setini kullanarak bir biyomedikal soru-cevaplama (BSO) sistemi geliştirmeye ve uygulamaya odaklanmaktadır. Çalışma, GPT-3.5, GPT-4, LLAMA3 ve Mistral dahil olmak üzere çeşitli modellerin klinik soruları ele alma performansını değerlendirmektedir. Temel metodolojiler arasında veri hazırlama, komut mühendisliği (prompt engineering) ve model adaptasyonu yer almaktadır. Değerlendirmede kesinlik (precision), duyarlılık (recall), F1 skoru, BLEU skorları ve gömme (embedding) tabanlı metrikler gibi ölçütler kullanılmaktadır. Sonuçlar, vaka bağlamının tamamını kullanmanın, parçalara ayırma (chunking) ve vektör deposu indeksleme yöntemlerine göre önemli ölçüde daha iyi performans gösterdiğini ortaya koymaktadır. Dikkat çekici bir şekilde, GPT-4, %44,7'lik bir tam eşleşme skoru elde ederek insan uzmanları geride bırakmıştır. İnce ayar (fine-tuning) alana özgü performansı artırsa da, aşırı öğrenme (overfitting) riski taşımaktadır. Bu araştırma, klinik karar verme ve tıp eğitimi için potansiyel faydalar sunarak BSO sistemlerindeki ilerlemelere katkıda bulunmaktadır.

Anahtar Kelimeler

biyomedikal soru-cevap , CliCR , Değerlendirme , LLM , LangChain , GPT , Mistral , LLAMA , Cohere , RAG , İstem mühendisliği

Kaynakça

Buckley, T. A., Diao, J. A., Rodman, A., & Manrai, A. K. (2023). Accuracy of a vision-language model on challenging medical cases. ArXiv, abs/2311.05591. https://arxiv.org/abs/2311.05591
Chowdhury, M., Lim, E., Higham, A., McKinnon, R., Ventoura, N., He, Y., & Pennington, N.D. (2023). Can large language models safely address patient questions following cataract surgery? Clinical NLP, 1, 131–137. https://aclanthology.org/2023.clinicalnlp-1.17
Duong, D., & Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. medRxiv: The Preprint Server for Health Sciences. https://www.medrxiv.org/ content/10.1101/2023.01.27.23285115v1
Gu, Y., Tinn, R., Cheng, H., Lucas, M. R., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2020). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3, 1–23. https:// arxiv.org/abs/2007.15779
Hiesinger, W., Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J., Moor, M., Alexander,K., Ashley, E. A., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., & Nelson, J. (2023). Almanac: Retrieval-augmented language models for clinical medicine. Research Square. https://arxiv.org/abs/2303.01229
Holmes, J., Liu, Z., Zhang, L.-C., Ding, Y., Sio, T., Mcgee, L., Ashman, J., Li, X., Liu, T., Shen, J., & Liu, W. (2023). Evaluating large language models on a highly-specialized topic, radiation oncology physics. Frontiers in Oncology, 13. https://arxiv.org/abs/2304.01938
Holmes, J., Peng, R., Li, Y., Hu, J., Liu, Z., Wu, Z., Zhao, H., Jiang, X., Liu, W., Wei, H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating multiple large language models in pediatric ophthalmology. ArXiv, abs/2311.04368. https://arxiv.org/abs/2311.04368
Holmes, J., Ye, S., Li, Y., Wu, S.-N., Liu, Z., Wu, Z., Hu, J., Zhao, H., Jiang, X., Liu, W., Wei,H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating large language models in ophthalmology. ArXiv, abs/2311.04933. https://arxiv.org/abs/2311.04933
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Sciao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. ArXiv. https://arxiv. org/abs/2310.06825
Jin, Q., Yuan, Z., Xiong, G., Yu, Q., Ying, H., Tan, C., Chen, M., Huang, S., Liu, X., & Yu, S. (2021). Biomedical question answering: A survey of approaches and challenges. ACM Computing Surveys (CSUR), 55, 1–36. https://arxiv.org/abs/2102.05281
Kalyan, K. S., Rajasekharan, A., & Sangeetha, S. (2021). AMMU: A survey of transformer- based biomedical pretrained language models. ArXiv, abs/2105.00827. https://arxiv.org/abs/2105.00827
Koga, S. (2023). Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examinationstyle questions. Pathology International, 73.https://onlinelibrary.wiley. com/doi/10.1111/pin.13382
Kumari, A., Kumari, A., Singh, A., Singh, S., Juhi, A., Dhanvijay, A., Pinjar, M., & Mondal, H. (2023). Large language models in hematology case solving: A comparative study of chatgpt- 3.5, googlebard, and microsoft bing. Cureus,15.https://onlinelibrary.wiley.com/ doi/abs/10.1111/bjh.19738
Lee, D. T., Vaid, A., Menon, B. M., Freeman, R. R., Matteson, D. S., Marin, M. P., & Nadkarni, G. N. (2023). Development of a privacy preserving large language model for automated data extraction from thyroid cancer pathology reports. medRxiv. https://www.medrxiv.org/content/10.1101/2023.11. 07.23298000v1
Li, J., Wang, S., Zhang, M., Li, W., Lai, Y., & Kang, X. (2024). Agent hospital: A simulacrum of hospital with evolvable medical agents. ArXiv. https://arxiv.org/abs/2405.02957
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, S., & Altman, S. (2024). GPT-4 Technical Report. arXiv Preprint arXiv:2303.08774. https:// arxiv.org/abs/2303.08774
Pal, R., Garg, H., Patel, S., & Sethi, T. (2023).Biasamplification in intersectional subpopulations for clinical phenotyping by large language models. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.03. 22.23287585v1
Reese, J., Danis, D., Caulfied, J. H., Casiraghi, E., Valentini, G., Mungall, C., & Robinson, P. N. (2023). On the limitations of large language models in clinical diagnosis. medRxiv. https://www.medrxiv. org/content/10.1101/2023.07. 13.23292613v1
Schubert, M., Wick, W., & Venkataramani, V. (2023). Performance of large language models on a neurology board–style examination. JAMA Network Open, 6. https://jamanetwork.com/journals/jamanetworkopen/ fullarticle/2812620
Suster, S., & Daelemans, W. (2018). CliCR: a dataset of clinical case reports for machine reading comprehension. ArXiv, abs/1803.09720. https://arxiv.org/abs/1803.09720
Tinn, R., Cheng, H., Gu, Y., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2021). Fine-tuning large neural language models for biomedical natural language processing. Patterns. https://arxiv.org/ abs/2112.07869
Titus, A. J. (2023). NHANES-GPT: Large language models (LLMs) and the future of biostatistics. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.12.13.23299830v1
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, F., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. ArXiv. https://arxiv.org/abs/2302.13971
Üstün, A., Aryabumi, V., Yong, Z.-X., Ko, W.-Y., D’souza, D., Onilude, G., Bhandari, N.,Singh, S., Ooi, H.-L., Kayid, A., Vargus, F., Blunsom, P., Longpre, S., Muennighoff, N., Fadaee, M., Kreutzer, J., & Hooker, S. (2024). Aya model: An instruction finetuned open-access multilingual language model. ArXiv. https://arxiv.org/abs/2402.07827
Wang, Z. (2022). Modern question answering datasets and benchmarks: A survey. ArXiv, abs/2206.15030. https://arxiv.org/ abs/2206.15030

Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset

Yıl 2025, Cilt: 20 Sayı: 72, 167 - 188, 12.12.2025

Feras Almannaa , Ferdi Sönmez

https://izlik.org/JA33MT76KD

Öz

This paper focuses on developing and implementing a biomedical question-answering (BQA) system using large language models (LLMs) and the CliCR dataset, in combination with the LangChain framework. The study evaluates several models, including GPT-3.5, GPT-4, LLAMA3, and Mistral, in handling clinical questions. Key methodologies include data preparation, prompt engineering, and model adaptation. The evaluation employs metrics such as precision, recall, F1-score, BLEU scores, and embedding-based metrics. Results show that using the entire case context significantly outperforms chunking and vector store indexing methods. Notably, GPT-4 achieved an exact match score of 44.7%, surpassing human experts. Although fine-tuning improves domain-specific performance, there's a risk of overfitting. This research adds to the progress in BQA systems with possible benefits for clinical decision-making and medical education.

Anahtar Kelimeler

fine-tuning , LLM , Langchain , GPT , Mistral , LLAMA , Prompt Engineering , RAG , Question Answering , CliCR , Evaluation , Cohere

Kaynakça

Buckley, T. A., Diao, J. A., Rodman, A., & Manrai, A. K. (2023). Accuracy of a vision-language model on challenging medical cases. ArXiv, abs/2311.05591. https://arxiv.org/abs/2311.05591
Chowdhury, M., Lim, E., Higham, A., McKinnon, R., Ventoura, N., He, Y., & Pennington, N.D. (2023). Can large language models safely address patient questions following cataract surgery? Clinical NLP, 1, 131–137. https://aclanthology.org/2023.clinicalnlp-1.17
Duong, D., & Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. medRxiv: The Preprint Server for Health Sciences. https://www.medrxiv.org/ content/10.1101/2023.01.27.23285115v1
Gu, Y., Tinn, R., Cheng, H., Lucas, M. R., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2020). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3, 1–23. https:// arxiv.org/abs/2007.15779
Hiesinger, W., Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J., Moor, M., Alexander,K., Ashley, E. A., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., & Nelson, J. (2023). Almanac: Retrieval-augmented language models for clinical medicine. Research Square. https://arxiv.org/abs/2303.01229
Holmes, J., Liu, Z., Zhang, L.-C., Ding, Y., Sio, T., Mcgee, L., Ashman, J., Li, X., Liu, T., Shen, J., & Liu, W. (2023). Evaluating large language models on a highly-specialized topic, radiation oncology physics. Frontiers in Oncology, 13. https://arxiv.org/abs/2304.01938
Holmes, J., Peng, R., Li, Y., Hu, J., Liu, Z., Wu, Z., Zhao, H., Jiang, X., Liu, W., Wei, H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating multiple large language models in pediatric ophthalmology. ArXiv, abs/2311.04368. https://arxiv.org/abs/2311.04368
Holmes, J., Ye, S., Li, Y., Wu, S.-N., Liu, Z., Wu, Z., Hu, J., Zhao, H., Jiang, X., Liu, W., Wei,H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating large language models in ophthalmology. ArXiv, abs/2311.04933. https://arxiv.org/abs/2311.04933
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Sciao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. ArXiv. https://arxiv. org/abs/2310.06825
Jin, Q., Yuan, Z., Xiong, G., Yu, Q., Ying, H., Tan, C., Chen, M., Huang, S., Liu, X., & Yu, S. (2021). Biomedical question answering: A survey of approaches and challenges. ACM Computing Surveys (CSUR), 55, 1–36. https://arxiv.org/abs/2102.05281
Kalyan, K. S., Rajasekharan, A., & Sangeetha, S. (2021). AMMU: A survey of transformer- based biomedical pretrained language models. ArXiv, abs/2105.00827. https://arxiv.org/abs/2105.00827
Koga, S. (2023). Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examinationstyle questions. Pathology International, 73.https://onlinelibrary.wiley. com/doi/10.1111/pin.13382
Kumari, A., Kumari, A., Singh, A., Singh, S., Juhi, A., Dhanvijay, A., Pinjar, M., & Mondal, H. (2023). Large language models in hematology case solving: A comparative study of chatgpt- 3.5, googlebard, and microsoft bing. Cureus,15.https://onlinelibrary.wiley.com/ doi/abs/10.1111/bjh.19738
Lee, D. T., Vaid, A., Menon, B. M., Freeman, R. R., Matteson, D. S., Marin, M. P., & Nadkarni, G. N. (2023). Development of a privacy preserving large language model for automated data extraction from thyroid cancer pathology reports. medRxiv. https://www.medrxiv.org/content/10.1101/2023.11. 07.23298000v1
Li, J., Wang, S., Zhang, M., Li, W., Lai, Y., & Kang, X. (2024). Agent hospital: A simulacrum of hospital with evolvable medical agents. ArXiv. https://arxiv.org/abs/2405.02957
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, S., & Altman, S. (2024). GPT-4 Technical Report. arXiv Preprint arXiv:2303.08774. https:// arxiv.org/abs/2303.08774
Pal, R., Garg, H., Patel, S., & Sethi, T. (2023).Biasamplification in intersectional subpopulations for clinical phenotyping by large language models. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.03. 22.23287585v1
Reese, J., Danis, D., Caulfied, J. H., Casiraghi, E., Valentini, G., Mungall, C., & Robinson, P. N. (2023). On the limitations of large language models in clinical diagnosis. medRxiv. https://www.medrxiv. org/content/10.1101/2023.07. 13.23292613v1
Schubert, M., Wick, W., & Venkataramani, V. (2023). Performance of large language models on a neurology board–style examination. JAMA Network Open, 6. https://jamanetwork.com/journals/jamanetworkopen/ fullarticle/2812620
Suster, S., & Daelemans, W. (2018). CliCR: a dataset of clinical case reports for machine reading comprehension. ArXiv, abs/1803.09720. https://arxiv.org/abs/1803.09720
Tinn, R., Cheng, H., Gu, Y., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2021). Fine-tuning large neural language models for biomedical natural language processing. Patterns. https://arxiv.org/ abs/2112.07869
Titus, A. J. (2023). NHANES-GPT: Large language models (LLMs) and the future of biostatistics. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.12.13.23299830v1
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, F., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. ArXiv. https://arxiv.org/abs/2302.13971
Üstün, A., Aryabumi, V., Yong, Z.-X., Ko, W.-Y., D’souza, D., Onilude, G., Bhandari, N.,Singh, S., Ooi, H.-L., Kayid, A., Vargus, F., Blunsom, P., Longpre, S., Muennighoff, N., Fadaee, M., Kreutzer, J., & Hooker, S. (2024). Aya model: An instruction finetuned open-access multilingual language model. ArXiv. https://arxiv.org/abs/2402.07827
Wang, Z. (2022). Modern question answering datasets and benchmarks: A survey. ArXiv, abs/2206.15030. https://arxiv.org/ abs/2206.15030

Toplam 25 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Doğal Dil İşleme
Bölüm	Araştırma Makalesi
Yazarlar	Feras Almannaa 0009-0005-4645-1071 Ferdi Sönmez 0000-0002-5761-3867
Gönderilme Tarihi	8 Ağustos 2025
Kabul Tarihi	18 Kasım 2025
Yayımlanma Tarihi	12 Aralık 2025
IZ	https://izlik.org/JA33MT76KD
Yayımlandığı Sayı	Yıl 2025 Cilt: 20 Sayı: 72

Kaynak Göster

APA	Almannaa, F., & Sönmez, F. (2025). Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi, 20(72), 167-188. https://izlik.org/JA33MT76KD

Makale Dosyaları

Tam Metin

All site content, except where otherwise noted, is licensed under a Creative Common Attribution Licence. (CC-BY-NC 4.0)