ADR-0021: Self-RAG Future Consideration
Date: 2026-02-10 | Status: Deferred
Context
Self-RAG (Self-Reflective Retrieval-Augmented Generation) extends the standard RAG paradigm (Lewis et al., 2020) with a technique where the model decides mid-generation whether to:
- Retrieve additional context
- Evaluate retrieved passage relevance
- Critique its own output for faithfulness and completeness
Research reports up to +270% improvement on weak baselines. The model generates special "reflection tokens" that gate retrieval and self-assessment steps.
Decision
Defer implementation. Rationale:
-
Strong baseline: Our pipeline already includes hybrid search (vector + BM25), RRF fusion, BGE reranking, knowledge graph augmentation, and safety filtering. The +270% improvement is measured against naive RAG baselines, not production-grade pipelines.
-
Latency cost: Self-RAG requires multiple generation passes (generate, evaluate, potentially re-retrieve, re-generate). For a hospital search tool where users expect sub-7-second responses, this adds unacceptable latency (estimated 2-4x generation time).
-
Current bottleneck: Quality analysis shows the bottleneck is retrieval quality (getting the right chunks), not generation quality (producing good answers from retrieved context).
-
Complexity: Requires fine-tuning or prompt engineering for reflection tokens, custom generation loop, and careful evaluation to avoid infinite retrieval cycles.
Consequences
- Revisit when Phase 1-3 retrieval improvements plateau and generation quality becomes the limiting factor
- Monitor academic progress on latency-efficient Self-RAG variants
- Current "Think Harder" escalation flow provides a user-triggered approximation of self-reflection