Araştırma Makalesi
BibTex RIS Kaynak Göster

Biyomedikal Soru-Cevaplamada Dil Modeli Performansının Değerlendirilmesi: CliCR Veri Seti Üzerinde LangChain Çatısı Kullanılarak Bir Vaka Çalışması

Yıl 2025, Cilt: 20 Sayı: 72, 167 - 188, 12.12.2025

Öz

Bu makale, LangChain çatısı ile birlikte büyük dil modellerini (BDM'ler) ve CliCR veri setini kullanarak bir biyomedikal soru-cevaplama (BSO) sistemi geliştirmeye ve uygulamaya odaklanmaktadır. Çalışma, GPT-3.5, GPT-4, LLAMA3 ve Mistral dahil olmak üzere çeşitli modellerin klinik soruları ele alma performansını değerlendirmektedir. Temel metodolojiler arasında veri hazırlama, komut mühendisliği (prompt engineering) ve model adaptasyonu yer almaktadır. Değerlendirmede kesinlik (precision), duyarlılık (recall), F1 skoru, BLEU skorları ve gömme (embedding) tabanlı metrikler gibi ölçütler kullanılmaktadır. Sonuçlar, vaka bağlamının tamamını kullanmanın, parçalara ayırma (chunking) ve vektör deposu indeksleme yöntemlerine göre önemli ölçüde daha iyi performans gösterdiğini ortaya koymaktadır. Dikkat çekici bir şekilde, GPT-4, %44,7'lik bir tam eşleşme skoru elde ederek insan uzmanları geride bırakmıştır. İnce ayar (fine-tuning) alana özgü performansı artırsa da, aşırı öğrenme (overfitting) riski taşımaktadır. Bu araştırma, klinik karar verme ve tıp eğitimi için potansiyel faydalar sunarak BSO sistemlerindeki ilerlemelere katkıda bulunmaktadır.

Kaynakça

  • Buckley, T. A., Diao, J. A., Rodman, A., & Manrai, A. K. (2023). Accuracy of a vision-language model on challenging medical cases. ArXiv, abs/2311.05591. https://arxiv.org/abs/2311.05591
  • Chowdhury, M., Lim, E., Higham, A., McKinnon, R., Ventoura, N., He, Y., & Pennington, N.D. (2023). Can large language models safely address patient questions following cataract surgery? Clinical NLP, 1, 131–137. https://aclanthology.org/2023.clinicalnlp-1.17
  • Duong, D., & Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. medRxiv: The Preprint Server for Health Sciences. https://www.medrxiv.org/ content/10.1101/2023.01.27.23285115v1
  • Gu, Y., Tinn, R., Cheng, H., Lucas, M. R., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2020). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3, 1–23. https:// arxiv.org/abs/2007.15779
  • Hiesinger, W., Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J., Moor, M., Alexander,K., Ashley, E. A., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., & Nelson, J. (2023). Almanac: Retrieval-augmented language models for clinical medicine. Research Square. https://arxiv.org/abs/2303.01229
  • Holmes, J., Liu, Z., Zhang, L.-C., Ding, Y., Sio, T., Mcgee, L., Ashman, J., Li, X., Liu, T., Shen, J., & Liu, W. (2023). Evaluating large language models on a highly-specialized topic, radiation oncology physics. Frontiers in Oncology, 13. https://arxiv.org/abs/2304.01938
  • Holmes, J., Peng, R., Li, Y., Hu, J., Liu, Z., Wu, Z., Zhao, H., Jiang, X., Liu, W., Wei, H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating multiple large language models in pediatric ophthalmology. ArXiv, abs/2311.04368. https://arxiv.org/abs/2311.04368
  • Holmes, J., Ye, S., Li, Y., Wu, S.-N., Liu, Z., Wu, Z., Hu, J., Zhao, H., Jiang, X., Liu, W., Wei,H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating large language models in ophthalmology. ArXiv, abs/2311.04933. https://arxiv.org/abs/2311.04933
  • Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Sciao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. ArXiv. https://arxiv. org/abs/2310.06825
  • Jin, Q., Yuan, Z., Xiong, G., Yu, Q., Ying, H., Tan, C., Chen, M., Huang, S., Liu, X., & Yu, S. (2021). Biomedical question answering: A survey of approaches and challenges. ACM Computing Surveys (CSUR), 55, 1–36. https://arxiv.org/abs/2102.05281
  • Kalyan, K. S., Rajasekharan, A., & Sangeetha, S. (2021). AMMU: A survey of transformer- based biomedical pretrained language models. ArXiv, abs/2105.00827. https://arxiv.org/abs/2105.00827
  • Koga, S. (2023). Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examinationstyle questions. Pathology International, 73.https://onlinelibrary.wiley. com/doi/10.1111/pin.13382
  • Kumari, A., Kumari, A., Singh, A., Singh, S., Juhi, A., Dhanvijay, A., Pinjar, M., & Mondal, H. (2023). Large language models in hematology case solving: A comparative study of chatgpt- 3.5, googlebard, and microsoft bing. Cureus,15.https://onlinelibrary.wiley.com/ doi/abs/10.1111/bjh.19738
  • Lee, D. T., Vaid, A., Menon, B. M., Freeman, R. R., Matteson, D. S., Marin, M. P., & Nadkarni, G. N. (2023). Development of a privacy preserving large language model for automated data extraction from thyroid cancer pathology reports. medRxiv. https://www.medrxiv.org/content/10.1101/2023.11. 07.23298000v1
  • Li, J., Wang, S., Zhang, M., Li, W., Lai, Y., & Kang, X. (2024). Agent hospital: A simulacrum of hospital with evolvable medical agents. ArXiv. https://arxiv.org/abs/2405.02957
  • OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, S., & Altman, S. (2024). GPT-4 Technical Report. arXiv Preprint arXiv:2303.08774. https:// arxiv.org/abs/2303.08774
  • Pal, R., Garg, H., Patel, S., & Sethi, T. (2023).Biasamplification in intersectional subpopulations for clinical phenotyping by large language models. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.03. 22.23287585v1
  • Reese, J., Danis, D., Caulfied, J. H., Casiraghi, E., Valentini, G., Mungall, C., & Robinson, P. N. (2023). On the limitations of large language models in clinical diagnosis. medRxiv. https://www.medrxiv. org/content/10.1101/2023.07. 13.23292613v1
  • Schubert, M., Wick, W., & Venkataramani, V. (2023). Performance of large language models on a neurology board–style examination. JAMA Network Open, 6. https://jamanetwork.com/journals/jamanetworkopen/ fullarticle/2812620
  • Suster, S., & Daelemans, W. (2018). CliCR: a dataset of clinical case reports for machine reading comprehension. ArXiv, abs/1803.09720. https://arxiv.org/abs/1803.09720
  • Tinn, R., Cheng, H., Gu, Y., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2021). Fine-tuning large neural language models for biomedical natural language processing. Patterns. https://arxiv.org/ abs/2112.07869
  • Titus, A. J. (2023). NHANES-GPT: Large language models (LLMs) and the future of biostatistics. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.12.13.23299830v1
  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, F., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. ArXiv. https://arxiv.org/abs/2302.13971
  • Üstün, A., Aryabumi, V., Yong, Z.-X., Ko, W.-Y., D’souza, D., Onilude, G., Bhandari, N.,Singh, S., Ooi, H.-L., Kayid, A., Vargus, F., Blunsom, P., Longpre, S., Muennighoff, N., Fadaee, M., Kreutzer, J., & Hooker, S. (2024). Aya model: An instruction finetuned open-access multilingual language model. ArXiv. https://arxiv.org/abs/2402.07827
  • Wang, Z. (2022). Modern question answering datasets and benchmarks: A survey. ArXiv, abs/2206.15030. https://arxiv.org/ abs/2206.15030

Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset

Yıl 2025, Cilt: 20 Sayı: 72, 167 - 188, 12.12.2025

Öz

This paper focuses on developing and implementing a biomedical question-answering (BQA) system using large language models (LLMs) and the CliCR dataset, in combination with the LangChain framework. The study evaluates several models, including GPT-3.5, GPT-4, LLAMA3, and Mistral, in handling clinical questions. Key methodologies include data preparation, prompt engineering, and model adaptation. The evaluation employs metrics such as precision, recall, F1-score, BLEU scores, and embedding-based metrics. Results show that using the entire case context significantly outperforms chunking and vector store indexing methods. Notably, GPT-4 achieved an exact match score of 44.7%, surpassing human experts. Although fine-tuning improves domain-specific performance, there's a risk of overfitting. This research adds to the progress in BQA systems with possible benefits for clinical decision-making and medical education.

Kaynakça

  • Buckley, T. A., Diao, J. A., Rodman, A., & Manrai, A. K. (2023). Accuracy of a vision-language model on challenging medical cases. ArXiv, abs/2311.05591. https://arxiv.org/abs/2311.05591
  • Chowdhury, M., Lim, E., Higham, A., McKinnon, R., Ventoura, N., He, Y., & Pennington, N.D. (2023). Can large language models safely address patient questions following cataract surgery? Clinical NLP, 1, 131–137. https://aclanthology.org/2023.clinicalnlp-1.17
  • Duong, D., & Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. medRxiv: The Preprint Server for Health Sciences. https://www.medrxiv.org/ content/10.1101/2023.01.27.23285115v1
  • Gu, Y., Tinn, R., Cheng, H., Lucas, M. R., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2020). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3, 1–23. https:// arxiv.org/abs/2007.15779
  • Hiesinger, W., Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J., Moor, M., Alexander,K., Ashley, E. A., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., & Nelson, J. (2023). Almanac: Retrieval-augmented language models for clinical medicine. Research Square. https://arxiv.org/abs/2303.01229
  • Holmes, J., Liu, Z., Zhang, L.-C., Ding, Y., Sio, T., Mcgee, L., Ashman, J., Li, X., Liu, T., Shen, J., & Liu, W. (2023). Evaluating large language models on a highly-specialized topic, radiation oncology physics. Frontiers in Oncology, 13. https://arxiv.org/abs/2304.01938
  • Holmes, J., Peng, R., Li, Y., Hu, J., Liu, Z., Wu, Z., Zhao, H., Jiang, X., Liu, W., Wei, H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating multiple large language models in pediatric ophthalmology. ArXiv, abs/2311.04368. https://arxiv.org/abs/2311.04368
  • Holmes, J., Ye, S., Li, Y., Wu, S.-N., Liu, Z., Wu, Z., Hu, J., Zhao, H., Jiang, X., Liu, W., Wei,H., Zou, J., Liu, T., & Shao, Y. (2023). Evaluating large language models in ophthalmology. ArXiv, abs/2311.04933. https://arxiv.org/abs/2311.04933
  • Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Sciao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. ArXiv. https://arxiv. org/abs/2310.06825
  • Jin, Q., Yuan, Z., Xiong, G., Yu, Q., Ying, H., Tan, C., Chen, M., Huang, S., Liu, X., & Yu, S. (2021). Biomedical question answering: A survey of approaches and challenges. ACM Computing Surveys (CSUR), 55, 1–36. https://arxiv.org/abs/2102.05281
  • Kalyan, K. S., Rajasekharan, A., & Sangeetha, S. (2021). AMMU: A survey of transformer- based biomedical pretrained language models. ArXiv, abs/2105.00827. https://arxiv.org/abs/2105.00827
  • Koga, S. (2023). Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examinationstyle questions. Pathology International, 73.https://onlinelibrary.wiley. com/doi/10.1111/pin.13382
  • Kumari, A., Kumari, A., Singh, A., Singh, S., Juhi, A., Dhanvijay, A., Pinjar, M., & Mondal, H. (2023). Large language models in hematology case solving: A comparative study of chatgpt- 3.5, googlebard, and microsoft bing. Cureus,15.https://onlinelibrary.wiley.com/ doi/abs/10.1111/bjh.19738
  • Lee, D. T., Vaid, A., Menon, B. M., Freeman, R. R., Matteson, D. S., Marin, M. P., & Nadkarni, G. N. (2023). Development of a privacy preserving large language model for automated data extraction from thyroid cancer pathology reports. medRxiv. https://www.medrxiv.org/content/10.1101/2023.11. 07.23298000v1
  • Li, J., Wang, S., Zhang, M., Li, W., Lai, Y., & Kang, X. (2024). Agent hospital: A simulacrum of hospital with evolvable medical agents. ArXiv. https://arxiv.org/abs/2405.02957
  • OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, S., & Altman, S. (2024). GPT-4 Technical Report. arXiv Preprint arXiv:2303.08774. https:// arxiv.org/abs/2303.08774
  • Pal, R., Garg, H., Patel, S., & Sethi, T. (2023).Biasamplification in intersectional subpopulations for clinical phenotyping by large language models. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.03. 22.23287585v1
  • Reese, J., Danis, D., Caulfied, J. H., Casiraghi, E., Valentini, G., Mungall, C., & Robinson, P. N. (2023). On the limitations of large language models in clinical diagnosis. medRxiv. https://www.medrxiv. org/content/10.1101/2023.07. 13.23292613v1
  • Schubert, M., Wick, W., & Venkataramani, V. (2023). Performance of large language models on a neurology board–style examination. JAMA Network Open, 6. https://jamanetwork.com/journals/jamanetworkopen/ fullarticle/2812620
  • Suster, S., & Daelemans, W. (2018). CliCR: a dataset of clinical case reports for machine reading comprehension. ArXiv, abs/1803.09720. https://arxiv.org/abs/1803.09720
  • Tinn, R., Cheng, H., Gu, Y., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2021). Fine-tuning large neural language models for biomedical natural language processing. Patterns. https://arxiv.org/ abs/2112.07869
  • Titus, A. J. (2023). NHANES-GPT: Large language models (LLMs) and the future of biostatistics. medRxiv. https://www.medrxiv.org/ content/10.1101/2023.12.13.23299830v1
  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, F., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. ArXiv. https://arxiv.org/abs/2302.13971
  • Üstün, A., Aryabumi, V., Yong, Z.-X., Ko, W.-Y., D’souza, D., Onilude, G., Bhandari, N.,Singh, S., Ooi, H.-L., Kayid, A., Vargus, F., Blunsom, P., Longpre, S., Muennighoff, N., Fadaee, M., Kreutzer, J., & Hooker, S. (2024). Aya model: An instruction finetuned open-access multilingual language model. ArXiv. https://arxiv.org/abs/2402.07827
  • Wang, Z. (2022). Modern question answering datasets and benchmarks: A survey. ArXiv, abs/2206.15030. https://arxiv.org/ abs/2206.15030
Toplam 25 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Doğal Dil İşleme
Bölüm Araştırma Makalesi
Yazarlar

Feras Almannaa 0009-0005-4645-1071

Ferdi Sönmez 0000-0002-5761-3867

Gönderilme Tarihi 8 Ağustos 2025
Kabul Tarihi 18 Kasım 2025
Yayımlanma Tarihi 12 Aralık 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 20 Sayı: 72

Kaynak Göster

APA Almannaa, F., & Sönmez, F. (2025). Assessing language model performance in biomedical question-answering: A case study using the langchain framework on the CliCR Dataset. Anadolu Bil Meslek Yüksekokulu Dergisi, 20(72), 167-188.


All site content, except where otherwise noted, is licensed under a Creative Common Attribution Licence. (CC-BY-NC 4.0)

by-nc.png