ADR-0024: RAG Full Mode Feature Flag
Date: 2026-02-10 | Status: Accepted
Context
The demo requires maximum answer quality. Cross-encoder reranking (Khattab & Zaharia, 2020) substantially improves retrieval precision, but the existing two-stage approach -- normal query followed by optional "Think Harder" escalation -- adds friction for demo audiences who should see the best output immediately. Meanwhile, the standard query uses conservative defaults (fewer candidates, no reranking, smaller model) optimized for cost efficiency rather than quality.
Decision
Introduce rag_full_mode=True (default) as a configuration flag that enables maximum quality settings for all queries:
| Parameter | Standard Mode | Full Mode |
|---|---|---|
| Retrieval candidates | 30 | 50 (configurable via rag_rerank_candidates) |
| BGE reranking | Off | Always-on (top-15) |
| LLM model | Tier 2 (standard) | Tier 3 (flagship) |
| Max tokens | 1000 | 1500 (configurable via rag_full_mode_max_tokens) |
| Temperature | 0.1 | 0.1 (configurable via rag_full_mode_temperature) |
| Think Harder button | Visible | Hidden |
Implementation
In rag_service.py:
query_stream()checksrag_full_modefrom config- When enabled, parameter overrides are applied before the retrieval and generation steps
- Think Harder UI button is conditionally hidden via the
/api/v1/configendpoint
Consequences
Positive
- Best quality by default: Demo audiences see optimal results without manual escalation
- Simplified UX: No "Think Harder" decision point for users
- Single configuration toggle: Easy to switch between cost-optimized and quality-optimized modes
- Reversible: Set
rag_full_mode=Falseto restore standard mode for production cost management
Negative
- Higher latency: Jina reranking adds ~500ms per query (BGE local fallback: ~1.5s)
- Higher LLM cost per query: Tier 3 (
$2.00/$8.00 per 1M tokens) vs Tier 2 ($0.40/$1.60) -- roughly 5x cost increase per query - More retrieval overhead: 20 candidates with reranking increases processing slightly
Neutral
- Retrieval pipeline architecture unchanged (same stages, different parameters)
- Safety filtering and medical advice guardrails unchanged
- Knowledge graph augmentation unchanged
Alternatives Considered
Alternative 1: Always Use Maximum Settings (No Toggle)
Hardcode the high-quality settings without a feature flag.
- Pros: Simpler code (no branching)
- Cons: No path to cost optimization for production deployment
- Why rejected: Need ability to reduce costs when scaling beyond demo
Alternative 2: Per-Query Quality Selector
Let users choose quality level (fast/balanced/best) per query.
- Pros: Maximum flexibility
- Cons: Confusing for end users ("why would I not want the best answer?"), adds UI complexity
- Why rejected: Hospital search users should not need to understand quality tradeoffs