Pipeline Latency

This page documents the end-to-end latency characteristics of the ZOL RAG pipeline (@lewis2020rag), the optimizations applied (ADR-0034), and configuration knobs for tuning. Latency targets are framed against Nielsen's response-time thresholds (0.1 s, 1 s, 10 s) and reported at the tail per @beyer2016sre SRE practice.

Measured baseline (definitive run 2026-03-21, 302-q v3.6)

End-to-end response time across all 302 queries: median 7,829 ms, P90 12,182 ms, P99 20,925 ms (full distribution in thesis Chapter 4, Table 4.3). The per-stage numbers below are from the ADR-0034 optimization sprint and are honest approximations — for a stage-by-stage re-measurement against the current structured_call / Value Framework code path, see the "Re-measurement open items" callout at the end of this page.

Pipeline Stage Breakdown

Sequential vs Parallel Stages

Stage	Duration	Blocking?	Notes
Intent classification	~1.2s	Yes	Must complete before retrieval
User graph preference	~20-50ms	No	Runs in parallel with intent
Vector + BM25 search	~0.8-1.2s	Yes	Parallel internally
Graph search	~0.3-0.5s	No	Parallel with vector search
Cross-encoder reranking	~0.3-0.5s	Yes	After retrieval
Context assembly	~10-20ms	Yes	After reranking
LLM response	4-8s	Yes	Streaming: TTFT ~1-2s
Safety validation	~5-10ms	Yes	Regex-based, very fast
Follow-up suggestions	~0.5-1s	No	After final chunk (async)
Background evaluation	~2-5s	No	Fire-and-forget task

Total sequential path: ~6-12s (vs. 14-30s before optimization)

LLM Call Chain

The pipeline makes up to 4 LLM calls per query:

#	Call	Provider	Model	Latency
1	Intent classification	OpenAI	gpt-4.1-mini	~1.2s
2	Main response	OpenAI	gpt-4.1	4-8s
3	Follow-up suggestions	OpenAI	gpt-4.1-nano	~0.5-1s
4	Background evaluation	OpenAI	gpt-4.1-mini	~2-5s (non-blocking)

Only calls #1 and #2 are on the critical path. Call #3 runs after the user already sees the response. Call #4 is fire-and-forget.

All LLM calls use the OpenAI direct API (no OpenRouter intermediary). OpenRouter was removed from the codebase on 2026-03-20 after DNS reliability issues caused 28 eval failures in a single run.

Configuration Guide

Provider Routing

# All LLM calls use OpenAI direct API
OPENAI_API_KEY=sk-...
RAG_RESPONSE_PROVIDER=openai
RAG_RESPONSE_MODEL=gpt-4.1
RAG_LLM_PROVIDER=openai
INTENT_CLASSIFICATION_PROVIDER=openai

Streaming

# Enable true token streaming (default: true since ADR-0034)
RAG_TRUE_STREAMING_ENABLED=true

When enabled, tokens stream to the client as they arrive from the LLM. Post-generation safety validation runs after streaming completes; if a violation is detected, a retraction chunk replaces the streamed content.

Fallback Chain

# Ordered fallback: try each provider in sequence
LLM_FALLBACK_CHAIN=[
  {"provider": "openai", "model": "gpt-4.1"},
  {"provider": "openai", "model": "gpt-4.1"},
  {"provider": "ollama", "model": "llama3.2:3b"}
]

# Circuit breaker: 3 failures = skip provider for 60s
LLM_FALLBACK_CIRCUIT_THRESHOLD=3
LLM_FALLBACK_CIRCUIT_RECOVERY_SECONDS=60

Rate Limit Protection

The LLM client automatically retries rate limit errors (HTTP 429) with exponential backoff:

Attempt 1: wait 2s
Attempt 2: wait 5s
Attempt 3: wait 10s
After 3 retries: raise LLMRateLimitError, triggering fallback chain

Monitoring

Key metrics to watch:

timing.llm_ms in query audit logs: Main LLM response time
timing.intent_ms: Intent classification time
Fallback events: Logged as warnings when a provider is skipped
Rate limit retries: Logged as warnings with attempt count

Re-measurement open items

The per-stage numbers above were captured during the ADR-0034 optimization sprint (February 2026). Two pipeline changes have landed since which a future re-measurement should reflect:

structured_call structured-output helper — eight call sites including intent classification and query decomposition route through the structured_call helper (app.llm.structured) for schema-validated output with retries. A Pydantic AI Agent pattern was trialed here on 2026-05-09 but removed 2026-05-12 (commit b8d8da67) after telemetry showed it added ~720 ms per call; the helper restored first-attempt latency to baseline. Rare retries on malformed output add an estimated 400–800 ms but have not been measured at population scale.
Value Framework Stage 5b + synthetic doctor-list Stage 5c — Stage 5b adds approximately 2 ms (in-memory matrix multiply); Stage 5c adds approximately 5–10 ms (single SQL query) and only fires when intent=doctor_lookup and a department is detected.

Neither item moves the headline median materially — both are bounded by single-digit-millisecond CPU work or rare retry paths — but a clean re-measurement at P95/P99 would close the loop.

Pipeline Stage Breakdown​

Sequential vs Parallel Stages​

LLM Call Chain​

Configuration Guide​

Provider Routing​

Streaming​

Fallback Chain​

Rate Limit Protection​

Monitoring​

Re-measurement open items​