ADR-0024: RAG Full Mode Feature Flag

Date: 2026-02-10 | Status: Accepted

Context

The demo requires maximum answer quality. Cross-encoder reranking (Khattab & Zaharia, 2020) substantially improves retrieval precision, but the existing two-stage approach -- normal query followed by optional "Think Harder" escalation -- adds friction for demo audiences who should see the best output immediately. Meanwhile, the standard query uses conservative defaults (fewer candidates, no reranking, smaller model) optimized for cost efficiency rather than quality.

Decision

Introduce rag_full_mode=True (default) as a configuration flag that enables maximum quality settings for all queries:

Parameter	Standard Mode	Full Mode
Retrieval candidates	30	50 (configurable via `rag_rerank_candidates`)
BGE reranking	Off	Always-on (top-15)
LLM model	Tier 2 (standard)	Tier 3 (flagship)
Max tokens	1000	1500 (configurable via `rag_full_mode_max_tokens`)
Temperature	0.1	0.1 (configurable via `rag_full_mode_temperature`)
Think Harder button	Visible	Hidden

Implementation

In rag_service.py:

query_stream() checks rag_full_mode from config
When enabled, parameter overrides are applied before the retrieval and generation steps
Think Harder UI button is conditionally hidden via the /api/v1/config endpoint

Consequences

Positive

Best quality by default: Demo audiences see optimal results without manual escalation
Simplified UX: No "Think Harder" decision point for users
Single configuration toggle: Easy to switch between cost-optimized and quality-optimized modes
Reversible: Set rag_full_mode=False to restore standard mode for production cost management

Negative

Higher latency: Jina reranking adds ~500ms per query (BGE local fallback: ~1.5s)
Higher LLM cost per query: Tier 3 (~~$2.00/$8.00 per 1M tokens) vs Tier 2 (~~$0.40/$1.60) -- roughly 5x cost increase per query
More retrieval overhead: 20 candidates with reranking increases processing slightly

Neutral

Retrieval pipeline architecture unchanged (same stages, different parameters)
Safety filtering and medical advice guardrails unchanged
Knowledge graph augmentation unchanged

Alternatives Considered

Alternative 1: Always Use Maximum Settings (No Toggle)

Hardcode the high-quality settings without a feature flag.

Pros: Simpler code (no branching)
Cons: No path to cost optimization for production deployment
Why rejected: Need ability to reduce costs when scaling beyond demo

Alternative 2: Per-Query Quality Selector

Let users choose quality level (fast/balanced/best) per query.

Pros: Maximum flexibility
Cons: Confusing for end users ("why would I not want the best answer?"), adds UI complexity
Why rejected: Hospital search users should not need to understand quality tradeoffs

ADR-0008: User Feedback and Think Harder (original escalation design, now hidden in full mode)
ADR-0015: Taxonomy-Driven Normalization and LLM Optimization (model tier definitions)

Context​

Decision​

Implementation​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

Alternative 1: Always Use Maximum Settings (No Toggle)​

Alternative 2: Per-Query Quality Selector​

Related ADRs​