Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering

Hakan Emekci; Daniel Quillan Roxas

doi:10.34248/bsengineering.1849342

TR EN

Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering

Öz

Retrieval-Augmented Generation (RAG) systems integrate large language models with information retrieval to ground responses in factual data. This study systematically evaluates the contribution of each RAG component in a medical question answering system through comprehensive ablation analysis. We designed a hierarchical RAG architecture with six key components: hierarchical intent classification, query rewriting, two-stage retrieval (dense retrieval with FAISS + cross-encoder reranking using Clinical-Longformer), and specialist routing. We conducted systematic ablation studies across seven configurations on 476 medical questions from MedQA benchmarks. Each configuration was evaluated independently using GPT-4o mini as an LLM judge across four metrics: context relevance, completeness, faithfulness, and correctness (1-5 Likert scale), with each metric assessed through separate evaluation calls to minimize inter-metric bias. Statistical significance was validated through paired t-tests with effect size calculations (Cohen’s d). The full system achieved an overall score of 3.64/5.0. Systematic ablation revealed two critical components: reranking (removal: -0.24 overall, P<0.001, d = -0.44) and specialists (removal: -0.17 overall, P<0.001, d = -0.29), both showing small but statistically significant effect sizes. Surprisingly, hierarchical intent classification degraded performance when included (+0.09 when removed, p = 0.010 for completeness), suggesting simpler query processing may be preferable. Query rewriting showed minimal impact (-0.04 overall), while raw query inclusion significantly affected completeness (-0.15, P<0.001). Reranking and specialist components are essential for medical RAG systems, with statistical significance confirmed across 476 queries. The counterintuitive finding that hierarchical intent classification degrades performance (P<0.05) suggests that architectural complexity does not always improve RAG system quality. These results provide evidence-based guidance for designing medical question answering systems, showing that reranking infrastructure and domain expertise are more critical than sophisticated query understanding techniques.

Anahtar Kelimeler

Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering

Öz

Retrieval-Augmented Generation (RAG) systems integrate large language models with information retrieval to ground responses in factual data. This study systematically evaluates the contribution of each RAG component in a medical question answering system through comprehensive ablation analysis. We designed a hierarchical RAG architecture with six key components: hierarchical intent classification, query rewriting, two-stage retrieval (dense retrieval with FAISS + cross-encoder reranking using Clinical-Longformer), and specialist routing. We conducted systematic ablation studies across seven configurations on 476 medical questions from MedQA benchmarks. Each configuration was evaluated independently using GPT-4o mini as an LLM judge across four metrics: context relevance, completeness, faithfulness, and correctness (1-5 Likert scale), with each metric assessed through separate evaluation calls to minimize inter-metric bias. Statistical significance was validated through paired t-tests with effect size calculations (Cohen’s d). The full system achieved an overall score of 3.64/5.0. Systematic ablation revealed two critical components: reranking (removal: -0.24 overall, P<0.001, d = -0.44) and specialists (removal: -0.17 overall, P<0.001, d = -0.29), both showing small but statistically significant effect sizes. Surprisingly, hierarchical intent classification degraded performance when included (+0.09 when removed, p = 0.010 for completeness), suggesting simpler query processing may be preferable. Query rewriting showed minimal impact (-0.04 overall), while raw query inclusion significantly affected completeness (-0.15, P<0.001). Reranking and specialist components are essential for medical RAG systems, with statistical significance confirmed across 476 queries. The counterintuitive finding that hierarchical intent classification degrades performance (P<0.05) suggests that architectural complexity does not always improve RAG system quality. These results provide evidence-based guidance for designing medical question answering systems, showing that reranking infrastructure and domain expertise are more critical than sophisticated query understanding techniques.

Anahtar Kelimeler

Etik Beyan

Ethics committee approval was not required for this study because of there was no study on animals or humans.

Kaynakça

Belkin, N. J., Oddy, R. N., & Brooks, H. M. (1982). ASK for information retrieval: Part I. Background and theory. Journal of Documentation, 38(2), 61–71. https://doi.org/10.1108/eb026722
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv. https://arxiv.org/abs/2004.05150
Ben Abacha, A., & Demner-Fushman, D. (2019). On the summarization of consumer health questions. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics içinde (pp. 2228–2234). https://doi.org/10.18653/v1/P19-1215
Ben Abacha, A., Yim, W., Michalopoulos, G., & Lin, T. (2023). An investigation of evaluation methods in automatic medical note generation. Findings of the Association for Computational Linguistics: ACL 2023 içinde (pp. 2575–2588). https://doi.org/10.18653/v1/2023.findings-acl.161
Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval içinde (pp. 335–336). https://doi.org/10.1145/290941.291025
Casanueva, I., Temčinas, T., Gerz, D., Henderson, M., & Vulić, I. (2020). Efficient intent detection with dual sentence encoders. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI içinde (pp. 38–45). https://doi.org/10.18653/v1/2020.nlp4convai-1.5
Chen, N., Su, X., Liu, T., Hao, Q., & Wei, M. (2020). A benchmark dataset and case study for Chinese medical question intent classification. BMC Medical Informatics and Decision Making, 20, 125. https://doi.org/10.1186/s12911-020-1122-3
Chu, Y. W., Zhang, K., Malon, C., & Min, M. R. (2025). Reducing hallucinations of medical multimodal large language models with visual retrieval-augmented generation. arXiv. https://arxiv.org/abs/2502.15040

Deka, P., Jurek-Loughrey, A., & Padmanabhan, D. (2022). Improved methods to aid unsupervised evidence-based fact checking for online health news. Journal of Data Intelligence, 3(4), 474–505. https://doi.org/10.26421/JDI3.4-5
Dorfner, F. J., Dada, A., Busch, F., Makowski, M. R., Han, T., Truhn, D., Kleesiek, J., Sushil, M., Adams, L. C., & Bressem, K. K. (2025). Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. Journal of the American Medical Informatics Association, 32(6), 1015–1024. https://doi.org/10.1093/jamia/ocaf045
Fu, T., Huang, K., Xiao, C., Glass, L. M., & Sun, J. (2022). HINT: Hierarchical interaction network for clinical-trial-outcome predictions. Patterns, 3(4), 100445. https://doi.org/10.1016/j.patter.2022.100445
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1), 1–23. https://doi.org/10.1145/3458754
Jeong, M., Sohn, J., Sung, M., & Kang, J. (2024). Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics, 40(Ek 1), i119–i129. https://doi.org/10.1093/bioinformatics/btae238
Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14), 6421. https://doi.org/10.3390/app11146421
Jin, Q., Yuan, Z., Xiong, G., Yu, Q., Ying, H., Tan, C., Chen, M., Huang, S., Liu, X., & Yu, S. (2022). Biomedical question answering: A survey of approaches and challenges. ACM Computing Surveys, 55(2), 1–36. https://doi.org/10.1145/3490238
Johnson, J., Douze, M., & Jégou, H. (2021). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572
Kim, J., Podlasek, A., Shidara, K., Liu, F., Alaa, A., & Bernardo, D. (2025). Limitations of large language models in clinical problem-solving arising from inflexible reasoning. Scientific Reports, 15(1), 39426. https://doi.org/10.1038/s41598-025-22940-0
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems (9459–9474). Curran Associates, Inc.
Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H., & Luo, Y. (2023). A comparative study of pretrained language models for long clinical text. Journal of the American Medical Informatics Association, 30(2), 340–347. https://doi.org/10.1093/jamia/ocac225
Lu, W., Jiang, J., Shi, Y., Zhong, X., Gu, J., Huangfu, L., & Gong, M. (2023). Application of Entity-BERT model based on neuroscience and brain-like cognition in electronic medical record entity recognition. Frontiers in Neuroscience, 17, 1259652. https://doi.org/10.3389/fnins.2023.1259652
Maharjan, J., Garikipati, A., Singh, N. P., Cyrus, L., Sharma, M., Ciobanu, M., Barnes, G., Thapa, R., Mao, Q., & Das, R. (2024). OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Scientific Reports, 14, 14156. https://doi.org/10.1038/s41598-024-64827-6
Manas, G., Aribandi, V., Kursuncu, U., Alambo, A., Shalin, V. L., Thirunarayan, K., Beich, J., Narasimhan, M., & Sheth, A. (2021). Knowledge-infused abstractive summarization of clinical diagnostic interviews: Framework development study. JMIR Mental Health, 8(5), e20865. https://doi.org/10.2196/20865
Mishra, R., Bian, J., Fiszman, M., Weir, C. R., Jonnalagadda, S., Mostafa, J., & Del Fiol, G. (2014). Text summarization in the biomedical domain: A systematic review of recent research. Journal of Biomedical Informatics, 52, 457–467. https://doi.org/10.1016/j.jbi.2014.06.009
OpenAI. (2024). GPT-4o mini: Advancing cost-efficient intelligence. OpenAI. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed on 29 January 2026).
Robertson, S. E. (1997). The probability ranking principle in IR. K. Sparck Jones & P. Willett (Editörler), Readings in information retrieval içinde (pp. 281–286). Morgan Kaufmann.
Selmi, W., Kammoun, H., & Amous, I. (2022). Semantic-based hybrid query reformulation for biomedical information retrieval. The Computer Journal, 66(9), 2296–2316. https://doi.org/10.1093/comjnl/bxac078
Simonds, T., Kurniawan, K., & Lau, J. H. (2024). MoDEM: Mixture of domain expert models. Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association içinde (pp. 75–88). Association for Computational Linguistics. https://aclanthology.org/2024.alta-1.6/
Sun, K., Yu, D., Chen, J., Yu, D., Choi, Y., & Cardie, C. (2019). DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics, 7, 217–231. https://doi.org/10.1162/tacl_a_00264
Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., Ber, D. S. W., Lim, J. Y., Eckhoff, H. B., Lim, G. S. W., Tso, C. F., Wong, D. S. L., Li, S., Xu, L., Hussain, R. Z., Xiang, Y., Lu, Y., Liu, N., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29, 1930–1940. https://doi.org/10.1038/s41591-023-02448-8
Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers, M. R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artières, T., Ngomo, A. C., Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I., & Paliouras, G. (2015). An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16, 138. https://doi.org/10.1186/s12859-015-0564-6
Wei, Z., Guo, D., Huang, D., Zhang, Q., Zhang, S., Jiang, K., & Li, R. (2024). Detecting and mitigating the ungrounded hallucinations in text generation by LLMs. Proceedings of the 2023 International Conference on Artificial Intelligence, Systems and Network Security içinde (pp. 1–6). https://doi.org/10.1145/3661638.3661653
Yang, D., Wei, J., Li, M., Liu, J., Liu, L., Hu, M., He, J., Ju, Y., Zhou, W., Liu, Y., & Zhang, L. (2025). MedAide: Information fusion and anatomy of medical intents via LLM-based agent collaboration. Information Fusion, 127, 103743. https://doi.org/10.1016/j.inffus.2025.103743
Zhang, Y., Yang, R., Xu, X., Li, R., Xiao, J., Shen, J., & Han, J. (2025). TELEClass: Taxonomy enrichment and LLM-enhanced hierarchical text classification with minimal supervision. WWW '25: Proceedings of the ACM on Web Conference 2025 içinde (pp. 2032–2042). https://doi.org/10.1145/3696410.3714940
Zhao, W., Deng, Z., Yadav, S., & Yu, P. S. (2024). Heterogeneous knowledge grounding for medical question answering with retrieval augmented large language model. Companion Proceedings of the ACM Web Conference 2024 içinde (pp. 1535–1538). https://doi.org/10.1145/3589335.3651941

Ayrıntılar

Birincil Dil

İngilizce

Konular

Bilgi Modelleme, Yönetim ve Ontolojiler

Bölüm

Araştırma Makalesi

Yazarlar

Hakan Emekci ^*
0000-0002-4074-5600
Türkiye

Daniel Quillan Roxas
0009-0000-4484-6751
Türkiye

Yayımlanma Tarihi

15 Mart 2026

Gönderilme Tarihi

28 Aralık 2025

Kabul Tarihi

28 Ocak 2026

Yayımlandığı Sayı

Yıl 2026 Cilt: 9 Sayı: 2

DOI

https://doi.org/10.34248/bsengineering.1849342

IZ

https://izlik.org/JA59DL84TU

Kaynak Göster

RIS / Bibtex

APA

Emekci, H., & Roxas, D. Q. (2026). Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. Black Sea Journal of Engineering and Science, 9(2), 549-561. https://doi.org/10.34248/bsengineering.1849342

AMA

1.Emekci H, Roxas DQ. Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. BSJ Eng. Sci. 2026;9(2):549-561. doi:10.34248/bsengineering.1849342

Chicago

Emekci, Hakan, ve Daniel Quillan Roxas. 2026. “Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering”. Black Sea Journal of Engineering and Science 9 (2): 549-61. https://doi.org/10.34248/bsengineering.1849342.

EndNote

Emekci H, Roxas DQ (01 Mart 2026) Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. Black Sea Journal of Engineering and Science 9 2 549–561.

IEEE

[1]H. Emekci ve D. Q. Roxas, “Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering”, BSJ Eng. Sci., c. 9, sy 2, ss. 549–561, Mar. 2026, doi: 10.34248/bsengineering.1849342.

ISNAD

Emekci, Hakan - Roxas, Daniel Quillan. “Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering”. Black Sea Journal of Engineering and Science 9/2 (01 Mart 2026): 549-561. https://doi.org/10.34248/bsengineering.1849342.

JAMA

1.Emekci H, Roxas DQ. Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. BSJ Eng. Sci. 2026;9:549–561.

MLA

Emekci, Hakan, ve Daniel Quillan Roxas. “Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering”. Black Sea Journal of Engineering and Science, c. 9, sy 2, Mart 2026, ss. 549-61, doi:10.34248/bsengineering.1849342.

Vancouver

1.Hakan Emekci, Daniel Quillan Roxas. Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. BSJ Eng. Sci. 01 Mart 2026;9(2):549-61. doi:10.34248/bsengineering.1849342