Skip to main content

Performance and Metrics

Performance in a RAG system is multi-dimensional (@lewis2020rag, @gao2024ragsurvey). Latency, cost, and quality form a three-way trade-off where optimizing one often degrades another. The ZOL Intelligent Search navigates this trade-off through careful model selection, caching, and the hybrid evaluation strategy. Tail-latency measurement and SLO framing follow @beyer2016sre; the underlying response-time UX targets follow @nielsen1993responsetimes (0.1 s for instantaneous feedback, 1 s for seamless flow, 10 s as the upper attention bound).

Source-of-truth for measured numbers

The waterfall below describes the design-time stage budget from the ADR-0034 optimization sprint (February 2026). The current measured end-to-end distribution across the 302-question golden set v3.6 is median 7,829 ms, P90 12,182 ms, P99 20,925 ms (thesis Chapter 4, Table 4.3). The ~5,500 ms total in the waterfall is consistent with the lower part of the measured distribution; the upper tail reflects long-form generation and follow-up-chain queries.

Response Time Breakdown

The end-to-end query pipeline completes in approximately 5.5 seconds (P50). The following waterfall shows where time is spent:

Time Budget Allocation

StageTime% of TotalOptimization
Cache check5ms< 0.1%PostgreSQL semantic cache (SHA-256 hash, Tier 1)
Intent classification + query rewriting400ms7%Tier 2 (combined in one LLM call)
Metadata filtering5ms< 0.1%In-memory lookup
Vector search600ms11%HNSW index
BM25 search600ms(parallel)PostgreSQL tsvector
Taxonomy search600ms(parallel)SQL query optimization
Metadata boosting5ms< 0.1%In-memory computation
Context assembly50ms1%Expand ±1, dedup, budget
Response generation3,500ms64%Tier 2 / Tier 3 full mode
Fast quality gate600ms11%Embedding similarity
Streaming overhead85ms1%WebSocket
Total~5,500ms100%
Key Insight

Response generation dominates the pipeline at 64% of total time. This is by design -- the generation model (Tier 2 / Tier 3 in full mode) is the highest-quality model in the stack, and the streaming interface masks perceived latency. Users see the response appearing word-by-word after ~4.5 seconds, rather than waiting the full 5.5 seconds for a complete response. Intent classification and query rewriting are combined into a single LLM call (~400ms), saving ~300ms compared to separate calls.

LLM Cost Optimization

The multi-model strategy minimizes cost by using the right model for each task:

TaskModelCost per 1M tokensMonthly VolumeMonthly Cost
Intent classification + query rewritingTier 2~$0.40/1M in~25K queries~$0.50
Response generation (standard)Tier 2~$0.40/1M in~15K queries~$3.00
Response generation (full mode)Tier 3~$2.00/1M in~15K queries~$7.50
Background evaluationTier 2~$0.40/1M in~15K queries~$0.30
EmbeddingsOpenAI text-embedding-3-large (1536d, hosted, ADR-0048)$0.13 / 1M tokensAll queries (~50 tokens/query × 25 K/mo ≈ 1.25 M tokens/yr)≈$0.16/year
Entity extractionRegex$0 (local)All documents$0
Graph entity validation + page summariesTier 2~$0.40/1M in~2,000 pages/run~$1-2/run*
Total~$8.70/month

*Graph validation is an ingestion-time cost, not per-query. It runs once per full corpus extraction (~2,000 pages), producing validated entities for the taxonomy and page summaries for contextual retrieval. A cross-page entity cache reduces LLM calls by 10-25%. See ADR-0014.

Caching Impact

The ~$8.70 estimate assumes a 40% cache hit rate, which avoids all LLM calls for repeated queries. Without caching, costs would be approximately $14/month. The PostgreSQL-based semantic query cache (ADR-0031, 1-hour TTL) provides a significant return on its minimal infrastructure cost through two-tier lookup (SHA-256 hash + embedding similarity).

The Hybrid Evaluation Breakthrough

The most significant performance decision was the shift from full inline evaluation to hybrid evaluation (fast gate + background analytics):

This architectural change reduced perceived latency by 88-91% while maintaining quality assurance through the fast embedding-similarity gate and providing comprehensive analytics via the asynchronous background evaluation.

Prometheus Metrics

The system exports metrics to Prometheus for monitoring and alerting:

Request Metrics

MetricTypeDescription
query_requests_totalCounterTotal query count by intent type
query_latency_secondsHistogramEnd-to-end query latency (P50, P95, P99)
cache_hit_ratioGaugePercentage of queries served from cache
retrieval_results_countHistogramNumber of results returned per query

Quality Metrics

MetricTypeDescription
quality_gate_scoreHistogramFast gate similarity scores
quality_gate_pass_ratioGaugePercentage of responses passing the gate
deepeval_faithfulnessHistogramBackground faithfulness scores
deepeval_relevancyHistogramBackground relevancy scores
feedback_positive_ratioGaugeUser thumbs-up percentage

Safety Metrics

MetricTypeDescription
safety_blocks_totalCounterIntent-based safety blocks
safety_validation_blocksCounterPost-generation safety blocks
medical_advice_incidentsCounterTarget: permanently at ZERO

Caching Strategy

The semantic query cache operates in PostgreSQL via pgvector (ADR-0031), providing a two-tier lookup system that operates on LLM-reformulated queries for maximum hit rates across languages:

Cache LevelMechanismLatencyHit Rate
Tier 1: Exact hashSHA-256 of reformulated query~1ms~30%
Tier 2: Semantic similaritypgvector HNSW cosine similarity >= 0.97~30ms~10-15% additional
Session context (Redis)In-memory conversation history< 1msPer-user

The semantic cache was migrated from Redis to PostgreSQL (ADR-0031) because Tier 2 approximate nearest neighbor search requires pgvector's HNSW index. Redis continues to serve ephemeral state (sessions, rate limits, token blacklist). See Storage Architecture for schema details.

Performance Targets

MetricTargetCurrent measuredSource / status
P50 latency< 10 s (@nielsen1993responsetimes)7.8 sthesis Chapter 4, Table 4.3 — Met
P90 latency< 15 s12.2 sSame source — Met
P99 latencynot yet targeted20.9 sSame source — Reported, no SLO yet
Cache hit rate> 30 %~40 %Internal telemetry — Exceeded
Quality gate pass rate> 85 %~89 %Internal telemetry — Met
Pass rate (302-q v3.6)≥ 95 %99.0 % (296/299)thesis Chapter 4, Table 4.1 — Exceeded
Monthly LLM cost< $20~$8.70Internal cost-tracking — Exceeded

Benchmarking Methodology

Performance measurements follow standard practices for latency benchmarking:

  • Measurement point: End-to-end from WebSocket message receipt to final streaming chunk delivery
  • Statistical reporting: P50 (median), P95, and P99 latencies to capture tail behavior
  • Warm cache exclusion: Cache hit responses are excluded from latency measurements to reflect true pipeline performance
  • Repeat measurements: Each benchmark reports the mean of 10 consecutive queries with the same input to account for LLM response variability
  • Environment: Measurements taken on the development environment (local Ollama, OpenRouter for cloud LLM models)