Skip to main content

ADR-0024: RAG Full Mode Feature Flag

Date: 2026-02-10 | Status: Accepted

Context

The demo requires maximum answer quality. Cross-encoder reranking (Khattab & Zaharia, 2020) substantially improves retrieval precision, but the existing two-stage approach -- normal query followed by optional "Think Harder" escalation -- adds friction for demo audiences who should see the best output immediately. Meanwhile, the standard query uses conservative defaults (fewer candidates, no reranking, smaller model) optimized for cost efficiency rather than quality.

Decision

Introduce rag_full_mode=True (default) as a configuration flag that enables maximum quality settings for all queries:

ParameterStandard ModeFull Mode
Retrieval candidates3050 (configurable via rag_rerank_candidates)
BGE rerankingOffAlways-on (top-15)
LLM modelTier 2 (standard)Tier 3 (flagship)
Max tokens10001500 (configurable via rag_full_mode_max_tokens)
Temperature0.10.1 (configurable via rag_full_mode_temperature)
Think Harder buttonVisibleHidden

Implementation

In rag_service.py:

  • query_stream() checks rag_full_mode from config
  • When enabled, parameter overrides are applied before the retrieval and generation steps
  • Think Harder UI button is conditionally hidden via the /api/v1/config endpoint

Consequences

Positive

  • Best quality by default: Demo audiences see optimal results without manual escalation
  • Simplified UX: No "Think Harder" decision point for users
  • Single configuration toggle: Easy to switch between cost-optimized and quality-optimized modes
  • Reversible: Set rag_full_mode=False to restore standard mode for production cost management

Negative

  • Higher latency: Jina reranking adds ~500ms per query (BGE local fallback: ~1.5s)
  • Higher LLM cost per query: Tier 3 ($2.00/$8.00 per 1M tokens) vs Tier 2 ($0.40/$1.60) -- roughly 5x cost increase per query
  • More retrieval overhead: 20 candidates with reranking increases processing slightly

Neutral

  • Retrieval pipeline architecture unchanged (same stages, different parameters)
  • Safety filtering and medical advice guardrails unchanged
  • Knowledge graph augmentation unchanged

Alternatives Considered

Alternative 1: Always Use Maximum Settings (No Toggle)

Hardcode the high-quality settings without a feature flag.

  • Pros: Simpler code (no branching)
  • Cons: No path to cost optimization for production deployment
  • Why rejected: Need ability to reduce costs when scaling beyond demo

Alternative 2: Per-Query Quality Selector

Let users choose quality level (fast/balanced/best) per query.

  • Pros: Maximum flexibility
  • Cons: Confusing for end users ("why would I not want the best answer?"), adds UI complexity
  • Why rejected: Hospital search users should not need to understand quality tradeoffs
  • ADR-0008: User Feedback and Think Harder (original escalation design, now hidden in full mode)
  • ADR-0015: Taxonomy-Driven Normalization and LLM Optimization (model tier definitions)