Pipeline Latency
This page documents the end-to-end latency characteristics of the ZOL RAG pipeline (@lewis2020rag), the optimizations applied (ADR-0034), and configuration knobs for tuning. Latency targets are framed against Nielsen's response-time thresholds (0.1 s, 1 s, 10 s) and reported at the tail per @beyer2016sre SRE practice.
End-to-end response time across all 302 queries: median 7,829 ms, P90 12,182 ms, P99 20,925 ms (full distribution in thesis Chapter 4, Table 4.3). The per-stage numbers below are from the ADR-0034 optimization sprint and are honest approximations — for a stage-by-stage re-measurement against the current structured_call / Value Framework code path, see the "Re-measurement open items" callout at the end of this page.
Pipeline Stage Breakdown
Sequential vs Parallel Stages
| Stage | Duration | Blocking? | Notes |
|---|---|---|---|
| Intent classification | ~1.2s | Yes | Must complete before retrieval |
| User graph preference | ~20-50ms | No | Runs in parallel with intent |
| Vector + BM25 search | ~0.8-1.2s | Yes | Parallel internally |
| Graph search | ~0.3-0.5s | No | Parallel with vector search |
| Cross-encoder reranking | ~0.3-0.5s | Yes | After retrieval |
| Context assembly | ~10-20ms | Yes | After reranking |
| LLM response | 4-8s | Yes | Streaming: TTFT ~1-2s |
| Safety validation | ~5-10ms | Yes | Regex-based, very fast |
| Follow-up suggestions | ~0.5-1s | No | After final chunk (async) |
| Background evaluation | ~2-5s | No | Fire-and-forget task |
Total sequential path: ~6-12s (vs. 14-30s before optimization)
LLM Call Chain
The pipeline makes up to 4 LLM calls per query:
| # | Call | Provider | Model | Latency |
|---|---|---|---|---|
| 1 | Intent classification | OpenAI | gpt-4.1-mini | ~1.2s |
| 2 | Main response | OpenAI | gpt-4.1 | 4-8s |
| 3 | Follow-up suggestions | OpenAI | gpt-4.1-nano | ~0.5-1s |
| 4 | Background evaluation | OpenAI | gpt-4.1-mini | ~2-5s (non-blocking) |
Only calls #1 and #2 are on the critical path. Call #3 runs after the user already sees the response. Call #4 is fire-and-forget.
All LLM calls use the OpenAI direct API (no OpenRouter intermediary). OpenRouter was removed from the codebase on 2026-03-20 after DNS reliability issues caused 28 eval failures in a single run.
Configuration Guide
Provider Routing
# All LLM calls use OpenAI direct API
OPENAI_API_KEY=sk-...
RAG_RESPONSE_PROVIDER=openai
RAG_RESPONSE_MODEL=gpt-4.1
RAG_LLM_PROVIDER=openai
INTENT_CLASSIFICATION_PROVIDER=openai
Streaming
# Enable true token streaming (default: true since ADR-0034)
RAG_TRUE_STREAMING_ENABLED=true
When enabled, tokens stream to the client as they arrive from the LLM. Post-generation safety validation runs after streaming completes; if a violation is detected, a retraction chunk replaces the streamed content.
Fallback Chain
# Ordered fallback: try each provider in sequence
LLM_FALLBACK_CHAIN=[
{"provider": "openai", "model": "gpt-4.1"},
{"provider": "openai", "model": "gpt-4.1"},
{"provider": "ollama", "model": "llama3.2:3b"}
]
# Circuit breaker: 3 failures = skip provider for 60s
LLM_FALLBACK_CIRCUIT_THRESHOLD=3
LLM_FALLBACK_CIRCUIT_RECOVERY_SECONDS=60
Rate Limit Protection
The LLM client automatically retries rate limit errors (HTTP 429) with exponential backoff:
- Attempt 1: wait 2s
- Attempt 2: wait 5s
- Attempt 3: wait 10s
- After 3 retries: raise
LLMRateLimitError, triggering fallback chain
Monitoring
Key metrics to watch:
timing.llm_msin query audit logs: Main LLM response timetiming.intent_ms: Intent classification time- Fallback events: Logged as warnings when a provider is skipped
- Rate limit retries: Logged as warnings with attempt count
Re-measurement open items
The per-stage numbers above were captured during the ADR-0034 optimization sprint (February 2026). Two pipeline changes have landed since which a future re-measurement should reflect:
structured_callstructured-output helper — eight call sites including intent classification and query decomposition route through thestructured_callhelper (app.llm.structured) for schema-validated output with retries. A Pydantic AI Agent pattern was trialed here on 2026-05-09 but removed 2026-05-12 (commitb8d8da67) after telemetry showed it added ~720 ms per call; the helper restored first-attempt latency to baseline. Rare retries on malformed output add an estimated 400–800 ms but have not been measured at population scale.- Value Framework Stage 5b + synthetic doctor-list Stage 5c — Stage 5b adds approximately 2 ms (in-memory matrix multiply); Stage 5c adds approximately 5–10 ms (single SQL query) and only fires when
intent=doctor_lookupand a department is detected.
Neither item moves the headline median materially — both are bounded by single-digit-millisecond CPU work or rare retry paths — but a clean re-measurement at P95/P99 would close the loop.