Performance and Metrics

Performance in a RAG system is multi-dimensional (@lewis2020rag, @gao2024ragsurvey). Latency, cost, and quality form a three-way trade-off where optimizing one often degrades another. The ZOL Intelligent Search navigates this trade-off through careful model selection, caching, and the hybrid evaluation strategy. Tail-latency measurement and SLO framing follow @beyer2016sre; the underlying response-time UX targets follow @nielsen1993responsetimes (0.1 s for instantaneous feedback, 1 s for seamless flow, 10 s as the upper attention bound).

Source-of-truth for measured numbers

The waterfall below describes the design-time stage budget from the ADR-0034 optimization sprint (February 2026). The current measured end-to-end distribution across the 302-question golden set v3.6 is median 7,829 ms, P90 12,182 ms, P99 20,925 ms (thesis Chapter 4, Table 4.3). The ~5,500 ms total in the waterfall is consistent with the lower part of the measured distribution; the upper tail reflects long-form generation and follow-up-chain queries.

Response Time Breakdown

The end-to-end query pipeline completes in approximately 5.5 seconds (P50). The following waterfall shows where time is spent:

Time Budget Allocation

Stage	Time	% of Total	Optimization
Cache check	5ms	< 0.1%	PostgreSQL semantic cache (SHA-256 hash, Tier 1)
Intent classification + query rewriting	400ms	7%	Tier 2 (combined in one LLM call)
Metadata filtering	5ms	< 0.1%	In-memory lookup
Vector search	600ms	11%	HNSW index
BM25 search	600ms	(parallel)	PostgreSQL tsvector
Taxonomy search	600ms	(parallel)	SQL query optimization
Metadata boosting	5ms	< 0.1%	In-memory computation
Context assembly	50ms	1%	Expand ±1, dedup, budget
Response generation	3,500ms	64%	Tier 2 / Tier 3 full mode
Fast quality gate	600ms	11%	Embedding similarity
Streaming overhead	85ms	1%	WebSocket
Total	~5,500ms	100%

Key Insight

Response generation dominates the pipeline at 64% of total time. This is by design -- the generation model (Tier 2 / Tier 3 in full mode) is the highest-quality model in the stack, and the streaming interface masks perceived latency. Users see the response appearing word-by-word after ~4.5 seconds, rather than waiting the full 5.5 seconds for a complete response. Intent classification and query rewriting are combined into a single LLM call (~400ms), saving ~300ms compared to separate calls.

LLM Cost Optimization

The multi-model strategy minimizes cost by using the right model for each task:

Task	Model	Cost per 1M tokens	Monthly Volume	Monthly Cost
Intent classification + query rewriting	Tier 2	~$0.40/1M in	~25K queries	~$0.50
Response generation (standard)	Tier 2	~$0.40/1M in	~15K queries	~$3.00
Response generation (full mode)	Tier 3	~$2.00/1M in	~15K queries	~$7.50
Background evaluation	Tier 2	~$0.40/1M in	~15K queries	~$0.30
Embeddings	OpenAI `text-embedding-3-large` (1536d, hosted, ADR-0048)	$0.13 / 1M tokens	All queries (~50 tokens/query × 25 K/mo ≈ 1.25 M tokens/yr)	≈$0.16/year
Entity extraction	Regex	$0 (local)	All documents	$0
Graph entity validation + page summaries	Tier 2	~$0.40/1M in	~2,000 pages/run	~$1-2/run*
Total				~$8.70/month

*Graph validation is an ingestion-time cost, not per-query. It runs once per full corpus extraction (~2,000 pages), producing validated entities for the taxonomy and page summaries for contextual retrieval. A cross-page entity cache reduces LLM calls by 10-25%. See ADR-0014.

Caching Impact

The ~$8.70 estimate assumes a 40% cache hit rate, which avoids all LLM calls for repeated queries. Without caching, costs would be approximately $14/month. The PostgreSQL-based semantic query cache (ADR-0031, 1-hour TTL) provides a significant return on its minimal infrastructure cost through two-tier lookup (SHA-256 hash + embedding similarity).

The Hybrid Evaluation Breakthrough

The most significant performance decision was the shift from full inline evaluation to hybrid evaluation (fast gate + background analytics):

This architectural change reduced perceived latency by 88-91% while maintaining quality assurance through the fast embedding-similarity gate and providing comprehensive analytics via the asynchronous background evaluation.

Prometheus Metrics

The system exports metrics to Prometheus for monitoring and alerting:

Request Metrics

Metric	Type	Description
`query_requests_total`	Counter	Total query count by intent type
`query_latency_seconds`	Histogram	End-to-end query latency (P50, P95, P99)
`cache_hit_ratio`	Gauge	Percentage of queries served from cache
`retrieval_results_count`	Histogram	Number of results returned per query

Quality Metrics

Metric	Type	Description
`quality_gate_score`	Histogram	Fast gate similarity scores
`quality_gate_pass_ratio`	Gauge	Percentage of responses passing the gate
`deepeval_faithfulness`	Histogram	Background faithfulness scores
`deepeval_relevancy`	Histogram	Background relevancy scores
`feedback_positive_ratio`	Gauge	User thumbs-up percentage

Safety Metrics

Metric	Type	Description
`safety_blocks_total`	Counter	Intent-based safety blocks
`safety_validation_blocks`	Counter	Post-generation safety blocks
`medical_advice_incidents`	Counter	Target: permanently at ZERO

Caching Strategy

The semantic query cache operates in PostgreSQL via pgvector (ADR-0031), providing a two-tier lookup system that operates on LLM-reformulated queries for maximum hit rates across languages:

Cache Level	Mechanism	Latency	Hit Rate
Tier 1: Exact hash	SHA-256 of reformulated query	~1ms	~30%
Tier 2: Semantic similarity	pgvector HNSW cosine similarity >= 0.97	~30ms	~10-15% additional
Session context (Redis)	In-memory conversation history	< 1ms	Per-user

The semantic cache was migrated from Redis to PostgreSQL (ADR-0031) because Tier 2 approximate nearest neighbor search requires pgvector's HNSW index. Redis continues to serve ephemeral state (sessions, rate limits, token blacklist). See Storage Architecture for schema details.

Performance Targets

Metric	Target	Current measured	Source / status
P50 latency	< 10 s (@nielsen1993responsetimes)	7.8 s	thesis Chapter 4, Table 4.3 — Met
P90 latency	< 15 s	12.2 s	Same source — Met
P99 latency	not yet targeted	20.9 s	Same source — Reported, no SLO yet
Cache hit rate	> 30 %	~40 %	Internal telemetry — Exceeded
Quality gate pass rate	> 85 %	~89 %	Internal telemetry — Met
Pass rate (302-q v3.6)	≥ 95 %	99.0 % (296/299)	thesis Chapter 4, Table 4.1 — Exceeded
Monthly LLM cost	< $20	~$8.70	Internal cost-tracking — Exceeded

Benchmarking Methodology

Performance measurements follow standard practices for latency benchmarking:

Measurement point: End-to-end from WebSocket message receipt to final streaming chunk delivery
Statistical reporting: P50 (median), P95, and P99 latencies to capture tail behavior
Warm cache exclusion: Cache hit responses are excluded from latency measurements to reflect true pipeline performance
Repeat measurements: Each benchmark reports the mean of 10 consecutive queries with the same input to account for LLM response variability
Environment: Measurements taken on the development environment (local Ollama, OpenRouter for cloud LLM models)

Response Time Breakdown​

Time Budget Allocation​

LLM Cost Optimization​

The Hybrid Evaluation Breakthrough​

Prometheus Metrics​

Request Metrics​

Quality Metrics​

Safety Metrics​

Caching Strategy​

Performance Targets​

Benchmarking Methodology​