ADR-0021: Self-RAG Future Consideration

Date: 2026-02-10 | Status: Deferred

Context

Self-RAG (Self-Reflective Retrieval-Augmented Generation) extends the standard RAG paradigm (Lewis et al., 2020) with a technique where the model decides mid-generation whether to:

Retrieve additional context
Evaluate retrieved passage relevance
Critique its own output for faithfulness and completeness

Research reports up to +270% improvement on weak baselines. The model generates special "reflection tokens" that gate retrieval and self-assessment steps.

Decision

Defer implementation. Rationale:

Strong baseline: Our pipeline already includes hybrid search (vector + BM25), RRF fusion, BGE reranking, knowledge graph augmentation, and safety filtering. The +270% improvement is measured against naive RAG baselines, not production-grade pipelines.
Latency cost: Self-RAG requires multiple generation passes (generate, evaluate, potentially re-retrieve, re-generate). For a hospital search tool where users expect sub-7-second responses, this adds unacceptable latency (estimated 2-4x generation time).
Current bottleneck: Quality analysis shows the bottleneck is retrieval quality (getting the right chunks), not generation quality (producing good answers from retrieved context).
Complexity: Requires fine-tuning or prompt engineering for reflection tokens, custom generation loop, and careful evaluation to avoid infinite retrieval cycles.

Consequences

Revisit when Phase 1-3 retrieval improvements plateau and generation quality becomes the limiting factor
Monitor academic progress on latency-efficient Self-RAG variants
Current "Think Harder" escalation flow provides a user-triggered approximation of self-reflection

ADR-0008: User Feedback and Think Harder (manual escalation as lightweight alternative)
ADR-0024: RAG Full Mode Feature Flag (always-on quality improvements)

Context​

Decision​

Consequences​

Related ADRs​

Context

Decision

Consequences

Related ADRs