Skip to main content

Pipeline Latency

This page documents the end-to-end latency characteristics of the ZOL RAG pipeline (@lewis2020rag), the optimizations applied (ADR-0034), and configuration knobs for tuning. Latency targets are framed against Nielsen's response-time thresholds (0.1 s, 1 s, 10 s) and reported at the tail per @beyer2016sre SRE practice.

Measured baseline (definitive run 2026-03-21, 302-q v3.6)

End-to-end response time across all 302 queries: median 7,829 ms, P90 12,182 ms, P99 20,925 ms (full distribution in thesis Chapter 4, Table 4.3). The per-stage numbers below are from the ADR-0034 optimization sprint and are honest approximations — for a stage-by-stage re-measurement against the current structured_call / Value Framework code path, see the "Re-measurement open items" callout at the end of this page.

Pipeline Stage Breakdown

Sequential vs Parallel Stages

StageDurationBlocking?Notes
Intent classification~1.2sYesMust complete before retrieval
User graph preference~20-50msNoRuns in parallel with intent
Vector + BM25 search~0.8-1.2sYesParallel internally
Graph search~0.3-0.5sNoParallel with vector search
Cross-encoder reranking~0.3-0.5sYesAfter retrieval
Context assembly~10-20msYesAfter reranking
LLM response4-8sYesStreaming: TTFT ~1-2s
Safety validation~5-10msYesRegex-based, very fast
Follow-up suggestions~0.5-1sNoAfter final chunk (async)
Background evaluation~2-5sNoFire-and-forget task

Total sequential path: ~6-12s (vs. 14-30s before optimization)

LLM Call Chain

The pipeline makes up to 4 LLM calls per query:

#CallProviderModelLatency
1Intent classificationOpenAIgpt-4.1-mini~1.2s
2Main responseOpenAIgpt-4.14-8s
3Follow-up suggestionsOpenAIgpt-4.1-nano~0.5-1s
4Background evaluationOpenAIgpt-4.1-mini~2-5s (non-blocking)

Only calls #1 and #2 are on the critical path. Call #3 runs after the user already sees the response. Call #4 is fire-and-forget.

All LLM calls use the OpenAI direct API (no OpenRouter intermediary). OpenRouter was removed from the codebase on 2026-03-20 after DNS reliability issues caused 28 eval failures in a single run.

Configuration Guide

Provider Routing

# All LLM calls use OpenAI direct API
OPENAI_API_KEY=sk-...
RAG_RESPONSE_PROVIDER=openai
RAG_RESPONSE_MODEL=gpt-4.1
RAG_LLM_PROVIDER=openai
INTENT_CLASSIFICATION_PROVIDER=openai

Streaming

# Enable true token streaming (default: true since ADR-0034)
RAG_TRUE_STREAMING_ENABLED=true

When enabled, tokens stream to the client as they arrive from the LLM. Post-generation safety validation runs after streaming completes; if a violation is detected, a retraction chunk replaces the streamed content.

Fallback Chain

# Ordered fallback: try each provider in sequence
LLM_FALLBACK_CHAIN=[
{"provider": "openai", "model": "gpt-4.1"},
{"provider": "openai", "model": "gpt-4.1"},
{"provider": "ollama", "model": "llama3.2:3b"}
]

# Circuit breaker: 3 failures = skip provider for 60s
LLM_FALLBACK_CIRCUIT_THRESHOLD=3
LLM_FALLBACK_CIRCUIT_RECOVERY_SECONDS=60

Rate Limit Protection

The LLM client automatically retries rate limit errors (HTTP 429) with exponential backoff:

  • Attempt 1: wait 2s
  • Attempt 2: wait 5s
  • Attempt 3: wait 10s
  • After 3 retries: raise LLMRateLimitError, triggering fallback chain

Monitoring

Key metrics to watch:

  • timing.llm_ms in query audit logs: Main LLM response time
  • timing.intent_ms: Intent classification time
  • Fallback events: Logged as warnings when a provider is skipped
  • Rate limit retries: Logged as warnings with attempt count

Re-measurement open items

The per-stage numbers above were captured during the ADR-0034 optimization sprint (February 2026). Two pipeline changes have landed since which a future re-measurement should reflect:

  1. structured_call structured-output helper — eight call sites including intent classification and query decomposition route through the structured_call helper (app.llm.structured) for schema-validated output with retries. A Pydantic AI Agent pattern was trialed here on 2026-05-09 but removed 2026-05-12 (commit b8d8da67) after telemetry showed it added ~720 ms per call; the helper restored first-attempt latency to baseline. Rare retries on malformed output add an estimated 400–800 ms but have not been measured at population scale.
  2. Value Framework Stage 5b + synthetic doctor-list Stage 5c — Stage 5b adds approximately 2 ms (in-memory matrix multiply); Stage 5c adds approximately 5–10 ms (single SQL query) and only fires when intent=doctor_lookup and a department is detected.

Neither item moves the headline median materially — both are bounded by single-digit-millisecond CPU work or rare retry paths — but a clean re-measurement at P95/P99 would close the loop.