Skip to main content

ADR-0021: Self-RAG Future Consideration

Date: 2026-02-10 | Status: Deferred

Context

Self-RAG (Self-Reflective Retrieval-Augmented Generation) extends the standard RAG paradigm (Lewis et al., 2020) with a technique where the model decides mid-generation whether to:

  1. Retrieve additional context
  2. Evaluate retrieved passage relevance
  3. Critique its own output for faithfulness and completeness

Research reports up to +270% improvement on weak baselines. The model generates special "reflection tokens" that gate retrieval and self-assessment steps.

Decision

Defer implementation. Rationale:

  1. Strong baseline: Our pipeline already includes hybrid search (vector + BM25), RRF fusion, BGE reranking, knowledge graph augmentation, and safety filtering. The +270% improvement is measured against naive RAG baselines, not production-grade pipelines.

  2. Latency cost: Self-RAG requires multiple generation passes (generate, evaluate, potentially re-retrieve, re-generate). For a hospital search tool where users expect sub-7-second responses, this adds unacceptable latency (estimated 2-4x generation time).

  3. Current bottleneck: Quality analysis shows the bottleneck is retrieval quality (getting the right chunks), not generation quality (producing good answers from retrieved context).

  4. Complexity: Requires fine-tuning or prompt engineering for reflection tokens, custom generation loop, and careful evaluation to avoid infinite retrieval cycles.

Consequences

  • Revisit when Phase 1-3 retrieval improvements plateau and generation quality becomes the limiting factor
  • Monitor academic progress on latency-efficient Self-RAG variants
  • Current "Think Harder" escalation flow provides a user-triggered approximation of self-reflection
  • ADR-0008: User Feedback and Think Harder (manual escalation as lightweight alternative)
  • ADR-0024: RAG Full Mode Feature Flag (always-on quality improvements)