ADR-0022: Dynamic Retrieval Future Consideration
Date: 2026-02-10 | Status: Deferred
Context
Dynamic retrieval techniques extend the standard RAG pipeline (Lewis et al., 2020). Approaches such as FLARE (Forward-Looking Active REtrieval) and DRAGIN (Dynamic Retrieval Augmented Generation based on Information Needs) retrieve additional context mid-generation when the model detects low-confidence tokens:
- Generate a partial response
- Detect uncertain tokens (low probability, hedging language)
- Formulate a targeted retrieval query based on the uncertain passage
- Retrieve additional context
- Continue generation with enriched context
This is particularly effective for long-form generation where the initial retrieval may not cover all sub-topics.
Decision
Defer implementation. Rationale:
-
Short-form answers: The medical search chatbot produces short, focused answers (typically 2-5 sentences). Dynamic retrieval provides the most value for multi-paragraph generation where context needs shift mid-response.
-
Streaming complexity: The pipeline uses WebSocket streaming for real-time token delivery. Dynamic retrieval requires pausing mid-stream, performing a retrieval round-trip, and resuming -- adding significant architectural complexity.
-
Latency sensitivity: Each mid-generation retrieval adds 200-500ms (embedding + vector search + reranking). For short answers, this overhead exceeds the generation time itself.
-
Upfront retrieval sufficiency: With 50-100 candidates, RRF fusion, and BGE reranking to top-15, the upfront retrieval captures sufficient context for short-form answers.
Consequences
- Revisit if expanding to multi-step reasoning, long-form generation, or report-style outputs
- Monitor FLARE/DRAGIN research for latency-optimized variants
- Current architecture supports adding retrieval hooks at the service layer if needed later