TR
EN
Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering
Abstract
Retrieval-Augmented Generation (RAG) systems integrate large language models with information retrieval to ground responses in factual data. This study systematically evaluates the contribution of each RAG component in a medical question answering system through comprehensive ablation analysis. We designed a hierarchical RAG architecture with six key components: hierarchical intent classification, query rewriting, two-stage retrieval (dense retrieval with FAISS + cross-encoder reranking using Clinical-Longformer), and specialist routing. We conducted systematic ablation studies across seven configurations on 476 medical questions from MedQA benchmarks. Each configuration was evaluated independently using GPT-4o mini as an LLM judge across four metrics: context relevance, completeness, faithfulness, and correctness (1-5 Likert scale), with each metric assessed through separate evaluation calls to minimize inter-metric bias. Statistical significance was validated through paired t-tests with effect size calculations (Cohen’s d). The full system achieved an overall score of 3.64/5.0. Systematic ablation revealed two critical components: reranking (removal: -0.24 overall, P<0.001, d = -0.44) and specialists (removal: -0.17 overall, P<0.001, d = -0.29), both showing small but statistically significant effect sizes. Surprisingly, hierarchical intent classification degraded performance when included (+0.09 when removed, p = 0.010 for completeness), suggesting simpler query processing may be preferable. Query rewriting showed minimal impact (-0.04 overall), while raw query inclusion significantly affected completeness (-0.15, P<0.001). Reranking and specialist components are essential for medical RAG systems, with statistical significance confirmed across 476 queries. The counterintuitive finding that hierarchical intent classification degrades performance (P<0.05) suggests that architectural complexity does not always improve RAG system quality. These results provide evidence-based guidance for designing medical question answering systems, showing that reranking infrastructure and domain expertise are more critical than sophisticated query understanding techniques.
Keywords
Ethical Statement
Ethics committee approval was not required for this study because of there was no study on animals or humans.
References
- Belkin, N. J., Oddy, R. N., & Brooks, H. M. (1982). ASK for information retrieval: Part I. Background and theory. Journal of Documentation, 38(2), 61–71. https://doi.org/10.1108/eb026722
- Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv. https://arxiv.org/abs/2004.05150
- Ben Abacha, A., & Demner-Fushman, D. (2019). On the summarization of consumer health questions. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics içinde (pp. 2228–2234). https://doi.org/10.18653/v1/P19-1215
- Ben Abacha, A., Yim, W., Michalopoulos, G., & Lin, T. (2023). An investigation of evaluation methods in automatic medical note generation. Findings of the Association for Computational Linguistics: ACL 2023 içinde (pp. 2575–2588). https://doi.org/10.18653/v1/2023.findings-acl.161
- Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval içinde (pp. 335–336). https://doi.org/10.1145/290941.291025
- Casanueva, I., Temčinas, T., Gerz, D., Henderson, M., & Vulić, I. (2020). Efficient intent detection with dual sentence encoders. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI içinde (pp. 38–45). https://doi.org/10.18653/v1/2020.nlp4convai-1.5
- Chen, N., Su, X., Liu, T., Hao, Q., & Wei, M. (2020). A benchmark dataset and case study for Chinese medical question intent classification. BMC Medical Informatics and Decision Making, 20, 125. https://doi.org/10.1186/s12911-020-1122-3
- Chu, Y. W., Zhang, K., Malon, C., & Min, M. R. (2025). Reducing hallucinations of medical multimodal large language models with visual retrieval-augmented generation. arXiv. https://arxiv.org/abs/2502.15040
Details
Primary Language
English
Subjects
Information Modelling, Management and Ontologies
Journal Section
Research Article
Publication Date
March 15, 2026
Submission Date
December 28, 2025
Acceptance Date
January 28, 2026
Published in Issue
Year 2026 Volume: 9 Number: 2
APA
Emekci, H., & Roxas, D. Q. (2026). Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. Black Sea Journal of Engineering and Science, 9(2), 549-561. https://doi.org/10.34248/bsengineering.1849342
AMA
1.Emekci H, Roxas DQ. Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. BSJ Eng. Sci. 2026;9(2):549-561. doi:10.34248/bsengineering.1849342
Chicago
Emekci, Hakan, and Daniel Quillan Roxas. 2026. “Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering”. Black Sea Journal of Engineering and Science 9 (2): 549-61. https://doi.org/10.34248/bsengineering.1849342.
EndNote
Emekci H, Roxas DQ (March 1, 2026) Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. Black Sea Journal of Engineering and Science 9 2 549–561.
IEEE
[1]H. Emekci and D. Q. Roxas, “Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering”, BSJ Eng. Sci., vol. 9, no. 2, pp. 549–561, Mar. 2026, doi: 10.34248/bsengineering.1849342.
ISNAD
Emekci, Hakan - Roxas, Daniel Quillan. “Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering”. Black Sea Journal of Engineering and Science 9/2 (March 1, 2026): 549-561. https://doi.org/10.34248/bsengineering.1849342.
JAMA
1.Emekci H, Roxas DQ. Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. BSJ Eng. Sci. 2026;9:549–561.
MLA
Emekci, Hakan, and Daniel Quillan Roxas. “Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering”. Black Sea Journal of Engineering and Science, vol. 9, no. 2, Mar. 2026, pp. 549-61, doi:10.34248/bsengineering.1849342.
Vancouver
1.Hakan Emekci, Daniel Quillan Roxas. Dissecting Medical RAG: Why Reranking Matters More Than Complexity in Question Answering. BSJ Eng. Sci. 2026 Mar. 1;9(2):549-61. doi:10.34248/bsengineering.1849342