Investigation of Low-Parameter Gemma 3 Models for Medical Reasoning Using CoT-Supported SFT and GRPO
Yıl 2025,
Cilt: 40 Sayı: 3, 593 - 606, 26.09.2025
İsmail İşeri
,
Alper Yıldırım
,
Alihan Öztorun
,
Tuğba Tuna
,
Arda Turan
Öz
This study aimed to develop and evaluate the complex reasoning capabilities of the Gemma 3 1B and Gemma 4B large language models within the medical domain. In this context, the performance of training strategies such as SFT (Supervised Fine-Tuning) and GRPO (Group Relative Policy Optimization) on the Gemma 3 1B and Gemma 4B models was investigated. A multi-stage approach was followed, starting with the evaluation of the base models, then teaching the Chain-of-Thought (CoT) format via SFT, and finally refining the reasoning process with GRPO. Evaluations conducted using the GPT-4.1 as a judge model demonstrated a significant improvement in model performance. It was shown that SFT and GRPO training successfully enhanced the model's ability to generate a logically consistent reasoning process, evidenced by an increase in Reasoning Accuracy from 26% to 31%. This outcome proves that the model learned how to think rather than merely memorizing answers.
Kaynakça
-
1. Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J. & Wang, B. (2024). FreedomIntelligence/HuatuoGPT-o1-8B. Hugging Face, https://huggingface.co/FreedomIntelligence/ HuatuoGPT-o1-8B, Erişim tarihi: 14.07.2025.
-
2. Wu, J., Deng, W., Li, X., Liu, S., Mi, T., Peng, Y., Xu, Z., Liu, Y., Cho, H., Choi, C.I. & Cao, Y. (2025). Medreason: Eliciting factual medical reasoning steps in LLMs via knowledge graphs. arXiv preprint, arXiv:2504.00993, Erişim tarihi: 25.08.2025.
-
3. Lai, Y., Zhong, J., Li, M., Zhao, S. & Yang, X., (2025). Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models. arXiv preprint, arXiv:2503.13939, Erişim tarihi: 23.07.2025.
-
4. DeepSeek, (2024). GitHub - deepseek-ai/DeepSeek-R1. GitHub, https://github.com/deepseek-ai/Deep Seek-R1, Erişim tarihi: 14.07.2025.
-
5. Ding, F., Wang, B., Zeng, Z. & Wang, Y. (2025). Multi-layer GRPO: Enhancing reasoning and self-correction in large language models. arXiv preprint, arXiv:2506.04746, Erişim tarihi: 23.07.2025.
-
6. Tinn, R., Cheng, H., Gu, Y., Usuyama, N., Liu, X., Naumann, T., Gao, J. & Poon, H. (2023). Fine-tuning large neural language models for biomedical natural language processing. Patterns, 4(4), 1-11.
-
7. Dai, W., Chen, P., Ekbote, C. & Liang, P.P. (2025). QoQ-Med: Building multimodal clinical foundation models with domain-aware GRPO training. arXiv preprint, arXiv: 2506.00711, Erişim tarihi: 23.07.2025.
-
8. Liu, M., Hu, W., Ding, J., Xu, J., Li, X., Zhu, L., Bai, Z., Shi, X., Wang, B., Song, H. & Liu, P. (2024). Medbench: A comprehensive, standardized, and reliable benchmarking system for evaluating Chinese medical large language models. Big Data Mining and Analytics, 7(4), 1116-1128.
-
9. Ogdu, C.U., Gurbuz, S., Karakose, M. & Hanoglu, E. (2025). Medical implications of LLM based clinical decision support systems in healthcare. 29th International Conference on Information Technology (IT 2025), Zabljak, Montenegro, 1-4.
-
10. Ali, H., Qadir, J., Alam, T., Househ, M. & Shah, Z. (2023). ChatGPT and large language models in healthcare: Opportunities and risks. IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings 2023), Mount Pleasant, MI, USA, 1-4.
-
11. Li, J., Deng, Y., Sun, Q., Zhu, J., Tian, Y., Li, J. & Zhu, T. (2024). Benchmarking large language models in evidence-based medicine. IEEE Journal of Biomedical and Health Informatics, 1-14.
-
12. Yang, X., Chen, A., PourNejatian, N., Shin, H.C., Smith, K.E., Parisien, C., Compas, C., Martin, C., Costa, A.B., Flores, M.G. & Zhang, Y. (2022). A large language model for electronic health records. NPJ Digital Medicine, 5(1), 194.
-
13. Kumar, R., Lomchavakarn, P., Angasinha, C., Intaratat, K., Boonsawad, P. & Sridee, S. (2025). Large
language model based system for clinical decision support. International Conference on Cognitive Computing in Engineering, Communications, Sciences and Biomedical Health Informatics (IC3ECSBHI 2025), Greater Noida, India, 389-394.
-
14. Ding, S., Ye, J., Hu, X. & Zou, N. (2024). Distilling the knowledge from a large-language model for health event prediction. Scientific Reports, 14(1), 30675.
-
15. Wang, C., Chen, Q., Shao, W. & He, X. (2024). KEMedGPT: Intelligent medical pre consultation with
knowledge enhanced large language model. Proceedings – 2024 IEEE International Conference on Medical Artificial Intelligence (MedAI 2024), Chongqing, China, 386-391.
-
16. Peng, C., Yang, X., Chen, A., Smith, K.E., PourNejatian, N., Costa, A.B., Martin, C., Flores, M.G., Zhang, Y., Magoc, T. & Lipori, G. (2023). A study of generative large language model for medical research and healthcare. NPJ Digital Medicine, 6(1), 210.
-
17. Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. (2021). Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digital Medicine, 4(1), 86.
-
18. Rahman, M.A., Preum, S.M., Williams, R.D., Alemzadeh, H. & Stankovic, J. (2023). EMS BERT: A pretrained language representation model for the emergency medical services (EMS) domain. Proceedings – 2023 IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE 2023), Orlando, FL, USA, 34-43.
-
19. Li, W., Yu, L., Wu, M., Liu, J., Hao, M. & Li, Y. (2023). DoctorGPT: A large language model with Chinese medical question answering capabilities. 2023 International Conference on High Performance Big Data and Intelligent Systems (HDIS 2023), Macau, China, 186-193.
-
20. Akilesh, S., Sheik Abdullah, A., Abinaya, R., Dhanushkodi, S. & Sekar, R. (2023). A novel AI based chatbot application for personalized medical diagnosis and review using large language models. 2023 IEEE International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE 2023), R.M.K. Engineering College, Chennai, India.
-
21. Peikos, G., Kasela, P. & Pasi, G. (2024). Leveraging large language models for medical information extraction and query generation. 2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Bangkok, Tayland, 367-372.
-
22. Kumar, P., Shreenidhi, G.L. & Rakesh Kumar, M. (2025). Medify-AI based LLM based healthcare system. International Conference on Frontier Technologies and Solutions (ICFTS 2025), Chennai, India, 1-9.
-
23. Kumar, R., Shreenidhi, G.L. & Sowmiya, S. (2025). MedHub – LLM-based healthcare system. 3rd International Conference on Augmented Intelligence and Sustainable Systems (ICAISS 2025), Trichy, India, 1650-1657.
-
24. Aizu’bi, S., Kanan, T. & Almiani, M. (2024). Large language models for knowledge discovery in healthcare. International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS 2024), Dubrovnik, Croatia, 183-190.
-
25. Fernando, S.P.A.A. & Wickramaarachchi, D. (2025). Large language model (LLM) support for preliminary consultation in healthcare. 5th International Conference on Advanced Research in Computing (ICARC 2025), Belihuloya, Sri Lanka, 1-6.
-
26. Nazi, Z.A. & Peng, W. (2024). Large language models in healthcare and medical domain: A review. Informatics, 11(3), 57.
-
27. Tian, S., Jin, Q., Yeganova, L., Lai, P.T., Zhu, Q., Chen, X., Yang, Y., Chen, Q., Kim, W., Comeau, D.C. & Islamaj, R. (2023). Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1), bbad493.
-
28. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. ICLR, 1(2), 3.
-
29. Rao, A.K.G., Jaggi, A. & Naidu, S. (2025). MEDFIT LLM: Medical enhancements through domain focused fine tuning of small language models. 2nd International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE 2025), Chennai, India, 1-5.
-
30. Kim, H., Hwang, H., Lee, J., Park, S., Kim, D., Lee, T., Yoon, C., Sohn, J., Park, J., Reykhart, O. & Fetherston, T. (2025). Small language models learn enhanced reasoning skills from medical textbooks. NPJ Digital Medicine, 8(1), 240.
-
31. Magnini, M., Aguzzi, G. & Montagna, S. (2025). Open-source small language models for personal medical assistant chatbots. Intelligence-Based Medicine, 11, 100197.
-
32. Wang, X., Dang, T., Kostakos, V. & Jia, H. (2024). Efficient and personalized mobile health event prediction via small language models. ACM MobiCom 2024 – 30th International Conference on Mobile Computing and Networking, Washington, D.C., USA, 2353-2358.
-
33. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. (2020). Measuring massive multitask language understanding. arXiv preprint, arXiv: 2009.03300, Erişim tarihi: 23.07.2025.
-
34. Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. & Steinhardt, J. (2020). Aligning AI with shared human values. arXiv preprint, arXiv: 2008.02275, Erişim tarihi: 23.07.2025.
-
35. Jin, D., Pan, E., Oufattole, N., Weng, W.H., Fang, H. & Szolovits, P. (2021). What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14), 6421.
-
36. Pal, A., Umapathi, L.K. & Sankarasubbu, M. (2022). MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. Conference on Health, Inference, and Learning, ABD.
-
37. Hugging Face, (2025). cais/mmlu. Hugging face datasets, https://huggingface.co/datasets/cais/mmlu, Erişim tarihi: 25.07.2025.
-
38. Fırat, H. ve Üzen, H. (2024). MR görüntülerinden alzheimer hastalığının sınıflandırılması için inception ve sıkma-uyarma ağı tabanlı derin öğrenme modeli. Çukurova Üniversitesi Mühendislik Fakültesi Dergisi, 39(2), 555-567.
-
39. Abut, S. (2024). AI-based model design for prediction of COPD grade from chest X-ray images: Bir model önerisi (COPD-GradeNet). Çukurova Üniversitesi Mühendislik Fakültesi Dergisi, 39(2), 325-338.
Tıbbi Akıl Yürütme için Gemma 3 Modellerinin CoT Destekli SFT ve GRPO ile İncelenmesi
Yıl 2025,
Cilt: 40 Sayı: 3, 593 - 606, 26.09.2025
İsmail İşeri
,
Alper Yıldırım
,
Alihan Öztorun
,
Tuğba Tuna
,
Arda Turan
Öz
Bu çalışmada, Gemma 3 1B ve Gemma 4B büyük dil modellerinin tıbbi alandaki karmaşık akıl yürütme yeteneklerinin geliştirilmesi ve değerlendirilmesi amaçlanmıştır. Bu kapsamda, SFT (Gözetimli İnce Ayar) ve GRPO (Grup Göreli Politika Optimizasyonu) gibi eğitim stratejilerinin Gemma 3 1B ve Gemma 4B modelleri üzerindeki performansı incelenmiştir. Temel modellerin değerlendirilmesi sürecinde, SFT ile Düşünce Zinciri (CoT) formatının öğretilmesi ve GRPO ile akıl yürütmenin rafine edilmesi adımlarını içeren çok aşamalı bir yaklaşım izlenmiştir. GPT-4.1 hakem modeliyle yapılan değerlendirmeler sonucunda, model performansının belirgin şekilde artırıldığı gösterilmiştir. SFT ve GRPO eğitimleriyle modelin mantıksal olarak tutarlı bir akıl yürütme süreci oluşturma becerisinin (Mantık Yürütme Doğruluğu’nun %26’dan %31’e yükseltilmesiyle) başarıyla geliştirildiği ortaya konmuştur. Bu durum, modelin cevapları ezberlemek yerine nasıl düşüneceğinin öğrenildiğini kanıtlamaktadır.
Etik Beyan
Bu çalışma anket/mülakat/gözlem yoluyla veri içermemektedir ve etik kurul izni gerektirmemektedir.
Destekleyen Kurum
TÜBİTAK
Teşekkür
Bu çalışma, Türkiye Bilimsel ve Teknolojik Araştırma Kurumu (TÜBİTAK) tarafından TEYDEB 1505 – Üniversite-Sanayi İşbirliği Destekleme Programı kapsamında desteklenen 5240094 numaralı proje kapsamında gerçekleştirilmiştir. Sağladığı katkılar nedeniyle TÜBİTAK’a teşekkür ederiz. Bu yayında yer alan görüş, öneri ve sonuçlar yalnızca yazarların sorumluluğundadır ve TÜBİTAK’ın görüşlerini yansıtmamaktadır.
Kaynakça
-
1. Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J. & Wang, B. (2024). FreedomIntelligence/HuatuoGPT-o1-8B. Hugging Face, https://huggingface.co/FreedomIntelligence/ HuatuoGPT-o1-8B, Erişim tarihi: 14.07.2025.
-
2. Wu, J., Deng, W., Li, X., Liu, S., Mi, T., Peng, Y., Xu, Z., Liu, Y., Cho, H., Choi, C.I. & Cao, Y. (2025). Medreason: Eliciting factual medical reasoning steps in LLMs via knowledge graphs. arXiv preprint, arXiv:2504.00993, Erişim tarihi: 25.08.2025.
-
3. Lai, Y., Zhong, J., Li, M., Zhao, S. & Yang, X., (2025). Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models. arXiv preprint, arXiv:2503.13939, Erişim tarihi: 23.07.2025.
-
4. DeepSeek, (2024). GitHub - deepseek-ai/DeepSeek-R1. GitHub, https://github.com/deepseek-ai/Deep Seek-R1, Erişim tarihi: 14.07.2025.
-
5. Ding, F., Wang, B., Zeng, Z. & Wang, Y. (2025). Multi-layer GRPO: Enhancing reasoning and self-correction in large language models. arXiv preprint, arXiv:2506.04746, Erişim tarihi: 23.07.2025.
-
6. Tinn, R., Cheng, H., Gu, Y., Usuyama, N., Liu, X., Naumann, T., Gao, J. & Poon, H. (2023). Fine-tuning large neural language models for biomedical natural language processing. Patterns, 4(4), 1-11.
-
7. Dai, W., Chen, P., Ekbote, C. & Liang, P.P. (2025). QoQ-Med: Building multimodal clinical foundation models with domain-aware GRPO training. arXiv preprint, arXiv: 2506.00711, Erişim tarihi: 23.07.2025.
-
8. Liu, M., Hu, W., Ding, J., Xu, J., Li, X., Zhu, L., Bai, Z., Shi, X., Wang, B., Song, H. & Liu, P. (2024). Medbench: A comprehensive, standardized, and reliable benchmarking system for evaluating Chinese medical large language models. Big Data Mining and Analytics, 7(4), 1116-1128.
-
9. Ogdu, C.U., Gurbuz, S., Karakose, M. & Hanoglu, E. (2025). Medical implications of LLM based clinical decision support systems in healthcare. 29th International Conference on Information Technology (IT 2025), Zabljak, Montenegro, 1-4.
-
10. Ali, H., Qadir, J., Alam, T., Househ, M. & Shah, Z. (2023). ChatGPT and large language models in healthcare: Opportunities and risks. IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings 2023), Mount Pleasant, MI, USA, 1-4.
-
11. Li, J., Deng, Y., Sun, Q., Zhu, J., Tian, Y., Li, J. & Zhu, T. (2024). Benchmarking large language models in evidence-based medicine. IEEE Journal of Biomedical and Health Informatics, 1-14.
-
12. Yang, X., Chen, A., PourNejatian, N., Shin, H.C., Smith, K.E., Parisien, C., Compas, C., Martin, C., Costa, A.B., Flores, M.G. & Zhang, Y. (2022). A large language model for electronic health records. NPJ Digital Medicine, 5(1), 194.
-
13. Kumar, R., Lomchavakarn, P., Angasinha, C., Intaratat, K., Boonsawad, P. & Sridee, S. (2025). Large
language model based system for clinical decision support. International Conference on Cognitive Computing in Engineering, Communications, Sciences and Biomedical Health Informatics (IC3ECSBHI 2025), Greater Noida, India, 389-394.
-
14. Ding, S., Ye, J., Hu, X. & Zou, N. (2024). Distilling the knowledge from a large-language model for health event prediction. Scientific Reports, 14(1), 30675.
-
15. Wang, C., Chen, Q., Shao, W. & He, X. (2024). KEMedGPT: Intelligent medical pre consultation with
knowledge enhanced large language model. Proceedings – 2024 IEEE International Conference on Medical Artificial Intelligence (MedAI 2024), Chongqing, China, 386-391.
-
16. Peng, C., Yang, X., Chen, A., Smith, K.E., PourNejatian, N., Costa, A.B., Martin, C., Flores, M.G., Zhang, Y., Magoc, T. & Lipori, G. (2023). A study of generative large language model for medical research and healthcare. NPJ Digital Medicine, 6(1), 210.
-
17. Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. (2021). Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digital Medicine, 4(1), 86.
-
18. Rahman, M.A., Preum, S.M., Williams, R.D., Alemzadeh, H. & Stankovic, J. (2023). EMS BERT: A pretrained language representation model for the emergency medical services (EMS) domain. Proceedings – 2023 IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE 2023), Orlando, FL, USA, 34-43.
-
19. Li, W., Yu, L., Wu, M., Liu, J., Hao, M. & Li, Y. (2023). DoctorGPT: A large language model with Chinese medical question answering capabilities. 2023 International Conference on High Performance Big Data and Intelligent Systems (HDIS 2023), Macau, China, 186-193.
-
20. Akilesh, S., Sheik Abdullah, A., Abinaya, R., Dhanushkodi, S. & Sekar, R. (2023). A novel AI based chatbot application for personalized medical diagnosis and review using large language models. 2023 IEEE International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE 2023), R.M.K. Engineering College, Chennai, India.
-
21. Peikos, G., Kasela, P. & Pasi, G. (2024). Leveraging large language models for medical information extraction and query generation. 2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Bangkok, Tayland, 367-372.
-
22. Kumar, P., Shreenidhi, G.L. & Rakesh Kumar, M. (2025). Medify-AI based LLM based healthcare system. International Conference on Frontier Technologies and Solutions (ICFTS 2025), Chennai, India, 1-9.
-
23. Kumar, R., Shreenidhi, G.L. & Sowmiya, S. (2025). MedHub – LLM-based healthcare system. 3rd International Conference on Augmented Intelligence and Sustainable Systems (ICAISS 2025), Trichy, India, 1650-1657.
-
24. Aizu’bi, S., Kanan, T. & Almiani, M. (2024). Large language models for knowledge discovery in healthcare. International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS 2024), Dubrovnik, Croatia, 183-190.
-
25. Fernando, S.P.A.A. & Wickramaarachchi, D. (2025). Large language model (LLM) support for preliminary consultation in healthcare. 5th International Conference on Advanced Research in Computing (ICARC 2025), Belihuloya, Sri Lanka, 1-6.
-
26. Nazi, Z.A. & Peng, W. (2024). Large language models in healthcare and medical domain: A review. Informatics, 11(3), 57.
-
27. Tian, S., Jin, Q., Yeganova, L., Lai, P.T., Zhu, Q., Chen, X., Yang, Y., Chen, Q., Kim, W., Comeau, D.C. & Islamaj, R. (2023). Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1), bbad493.
-
28. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. ICLR, 1(2), 3.
-
29. Rao, A.K.G., Jaggi, A. & Naidu, S. (2025). MEDFIT LLM: Medical enhancements through domain focused fine tuning of small language models. 2nd International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE 2025), Chennai, India, 1-5.
-
30. Kim, H., Hwang, H., Lee, J., Park, S., Kim, D., Lee, T., Yoon, C., Sohn, J., Park, J., Reykhart, O. & Fetherston, T. (2025). Small language models learn enhanced reasoning skills from medical textbooks. NPJ Digital Medicine, 8(1), 240.
-
31. Magnini, M., Aguzzi, G. & Montagna, S. (2025). Open-source small language models for personal medical assistant chatbots. Intelligence-Based Medicine, 11, 100197.
-
32. Wang, X., Dang, T., Kostakos, V. & Jia, H. (2024). Efficient and personalized mobile health event prediction via small language models. ACM MobiCom 2024 – 30th International Conference on Mobile Computing and Networking, Washington, D.C., USA, 2353-2358.
-
33. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. (2020). Measuring massive multitask language understanding. arXiv preprint, arXiv: 2009.03300, Erişim tarihi: 23.07.2025.
-
34. Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. & Steinhardt, J. (2020). Aligning AI with shared human values. arXiv preprint, arXiv: 2008.02275, Erişim tarihi: 23.07.2025.
-
35. Jin, D., Pan, E., Oufattole, N., Weng, W.H., Fang, H. & Szolovits, P. (2021). What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14), 6421.
-
36. Pal, A., Umapathi, L.K. & Sankarasubbu, M. (2022). MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. Conference on Health, Inference, and Learning, ABD.
-
37. Hugging Face, (2025). cais/mmlu. Hugging face datasets, https://huggingface.co/datasets/cais/mmlu, Erişim tarihi: 25.07.2025.
-
38. Fırat, H. ve Üzen, H. (2024). MR görüntülerinden alzheimer hastalığının sınıflandırılması için inception ve sıkma-uyarma ağı tabanlı derin öğrenme modeli. Çukurova Üniversitesi Mühendislik Fakültesi Dergisi, 39(2), 555-567.
-
39. Abut, S. (2024). AI-based model design for prediction of COPD grade from chest X-ray images: Bir model önerisi (COPD-GradeNet). Çukurova Üniversitesi Mühendislik Fakültesi Dergisi, 39(2), 325-338.