Retrieval-Augmented Generation (RAG) systems integrate large language models with information retrieval to ground responses in factual data. This study systematically evaluates the contribution of each RAG component in a medical question answering system through comprehensive ablation analysis. We designed a hierarchical RAG architecture with six key components: hierarchical intent classification, query rewriting, two-stage retrieval (dense retrieval with FAISS + cross-encoder reranking using Clinical-Longformer), and specialist routing. We conducted systematic ablation studies across seven configurations on 476 medical questions from MedQA benchmarks. Each configuration was evaluated independently using GPT-4o mini as an LLM judge across four metrics: context relevance, completeness, faithfulness, and correctness (1-5 Likert scale), with each metric assessed through separate evaluation calls to minimize inter-metric bias. Statistical significance was validated through paired t-tests with effect size calculations (Cohen’s d). The full system achieved an overall score of 3.64/5.0. Systematic ablation revealed two critical components: reranking (removal: -0.24 overall, P<0.001, d = -0.44) and specialists (removal: -0.17 overall, P<0.001, d = -0.29), both showing small but statistically significant effect sizes. Surprisingly, hierarchical intent classification degraded performance when included (+0.09 when removed, p = 0.010 for completeness), suggesting simpler query processing may be preferable. Query rewriting showed minimal impact (-0.04 overall), while raw query inclusion significantly affected completeness (-0.15, P<0.001). Reranking and specialist components are essential for medical RAG systems, with statistical significance confirmed across 476 queries. The counterintuitive finding that hierarchical intent classification degrades performance (P<0.05) suggests that architectural complexity does not always improve RAG system quality. These results provide evidence-based guidance for designing medical question answering systems, showing that reranking infrastructure and domain expertise are more critical than sophisticated query understanding techniques.
Ethics committee approval was not required for this study because of there was no study on animals or humans.
-
-
-
Retrieval-Augmented Generation (RAG) systems integrate large language models with information retrieval to ground responses in factual data. This study systematically evaluates the contribution of each RAG component in a medical question answering system through comprehensive ablation analysis. We designed a hierarchical RAG architecture with six key components: hierarchical intent classification, query rewriting, two-stage retrieval (dense retrieval with FAISS + cross-encoder reranking using Clinical-Longformer), and specialist routing. We conducted systematic ablation studies across seven configurations on 476 medical questions from MedQA benchmarks. Each configuration was evaluated independently using GPT-4o mini as an LLM judge across four metrics: context relevance, completeness, faithfulness, and correctness (1-5 Likert scale), with each metric assessed through separate evaluation calls to minimize inter-metric bias. Statistical significance was validated through paired t-tests with effect size calculations (Cohen’s d). The full system achieved an overall score of 3.64/5.0. Systematic ablation revealed two critical components: reranking (removal: -0.24 overall, P<0.001, d = -0.44) and specialists (removal: -0.17 overall, P<0.001, d = -0.29), both showing small but statistically significant effect sizes. Surprisingly, hierarchical intent classification degraded performance when included (+0.09 when removed, p = 0.010 for completeness), suggesting simpler query processing may be preferable. Query rewriting showed minimal impact (-0.04 overall), while raw query inclusion significantly affected completeness (-0.15, P<0.001). Reranking and specialist components are essential for medical RAG systems, with statistical significance confirmed across 476 queries. The counterintuitive finding that hierarchical intent classification degrades performance (P<0.05) suggests that architectural complexity does not always improve RAG system quality. These results provide evidence-based guidance for designing medical question answering systems, showing that reranking infrastructure and domain expertise are more critical than sophisticated query understanding techniques.
Ethics committee approval was not required for this study because of there was no study on animals or humans.
-
-
-
| Primary Language | English |
|---|---|
| Subjects | Information Modelling, Management and Ontologies |
| Journal Section | Research Article |
| Authors | |
| Project Number | - |
| Submission Date | December 28, 2025 |
| Acceptance Date | January 28, 2026 |
| Publication Date | March 15, 2026 |
| DOI | https://doi.org/10.34248/bsengineering.1849342 |
| IZ | https://izlik.org/JA59DL84TU |
| Published in Issue | Year 2026 Volume: 9 Issue: 2 |