Performance and Metrics
Performance in a RAG system is multi-dimensional (@lewis2020rag, @gao2024ragsurvey). Latency, cost, and quality form a three-way trade-off where optimizing one often degrades another. The ZOL Intelligent Search navigates this trade-off through careful model selection, caching, and the hybrid evaluation strategy. Tail-latency measurement and SLO framing follow @beyer2016sre; the underlying response-time UX targets follow @nielsen1993responsetimes (0.1 s for instantaneous feedback, 1 s for seamless flow, 10 s as the upper attention bound).
The waterfall below describes the design-time stage budget from the ADR-0034 optimization sprint (February 2026). The current measured end-to-end distribution across the 302-question golden set v3.6 is median 7,829 ms, P90 12,182 ms, P99 20,925 ms (thesis Chapter 4, Table 4.3). The ~5,500 ms total in the waterfall is consistent with the lower part of the measured distribution; the upper tail reflects long-form generation and follow-up-chain queries.
Response Time Breakdown
The end-to-end query pipeline completes in approximately 5.5 seconds (P50). The following waterfall shows where time is spent:
Time Budget Allocation
| Stage | Time | % of Total | Optimization |
|---|---|---|---|
| Cache check | 5ms | < 0.1% | PostgreSQL semantic cache (SHA-256 hash, Tier 1) |
| Intent classification + query rewriting | 400ms | 7% | Tier 2 (combined in one LLM call) |
| Metadata filtering | 5ms | < 0.1% | In-memory lookup |
| Vector search | 600ms | 11% | HNSW index |
| BM25 search | 600ms | (parallel) | PostgreSQL tsvector |
| Taxonomy search | 600ms | (parallel) | SQL query optimization |
| Metadata boosting | 5ms | < 0.1% | In-memory computation |
| Context assembly | 50ms | 1% | Expand ±1, dedup, budget |
| Response generation | 3,500ms | 64% | Tier 2 / Tier 3 full mode |
| Fast quality gate | 600ms | 11% | Embedding similarity |
| Streaming overhead | 85ms | 1% | WebSocket |
| Total | ~5,500ms | 100% |
Response generation dominates the pipeline at 64% of total time. This is by design -- the generation model (Tier 2 / Tier 3 in full mode) is the highest-quality model in the stack, and the streaming interface masks perceived latency. Users see the response appearing word-by-word after ~4.5 seconds, rather than waiting the full 5.5 seconds for a complete response. Intent classification and query rewriting are combined into a single LLM call (~400ms), saving ~300ms compared to separate calls.
LLM Cost Optimization
The multi-model strategy minimizes cost by using the right model for each task:
| Task | Model | Cost per 1M tokens | Monthly Volume | Monthly Cost |
|---|---|---|---|---|
| Intent classification + query rewriting | Tier 2 | ~$0.40/1M in | ~25K queries | ~$0.50 |
| Response generation (standard) | Tier 2 | ~$0.40/1M in | ~15K queries | ~$3.00 |
| Response generation (full mode) | Tier 3 | ~$2.00/1M in | ~15K queries | ~$7.50 |
| Background evaluation | Tier 2 | ~$0.40/1M in | ~15K queries | ~$0.30 |
| Embeddings | OpenAI text-embedding-3-large (1536d, hosted, ADR-0048) | $0.13 / 1M tokens | All queries (~50 tokens/query × 25 K/mo ≈ 1.25 M tokens/yr) | ≈$0.16/year |
| Entity extraction | Regex | $0 (local) | All documents | $0 |
| Graph entity validation + page summaries | Tier 2 | ~$0.40/1M in | ~2,000 pages/run | ~$1-2/run* |
| Total | ~$8.70/month |
*Graph validation is an ingestion-time cost, not per-query. It runs once per full corpus extraction (~2,000 pages), producing validated entities for the taxonomy and page summaries for contextual retrieval. A cross-page entity cache reduces LLM calls by 10-25%. See ADR-0014.
The ~$8.70 estimate assumes a 40% cache hit rate, which avoids all LLM calls for repeated queries. Without caching, costs would be approximately $14/month. The PostgreSQL-based semantic query cache (ADR-0031, 1-hour TTL) provides a significant return on its minimal infrastructure cost through two-tier lookup (SHA-256 hash + embedding similarity).
The Hybrid Evaluation Breakthrough
The most significant performance decision was the shift from full inline evaluation to hybrid evaluation (fast gate + background analytics):
This architectural change reduced perceived latency by 88-91% while maintaining quality assurance through the fast embedding-similarity gate and providing comprehensive analytics via the asynchronous background evaluation.
Prometheus Metrics
The system exports metrics to Prometheus for monitoring and alerting:
Request Metrics
| Metric | Type | Description |
|---|---|---|
query_requests_total | Counter | Total query count by intent type |
query_latency_seconds | Histogram | End-to-end query latency (P50, P95, P99) |
cache_hit_ratio | Gauge | Percentage of queries served from cache |
retrieval_results_count | Histogram | Number of results returned per query |
Quality Metrics
| Metric | Type | Description |
|---|---|---|
quality_gate_score | Histogram | Fast gate similarity scores |
quality_gate_pass_ratio | Gauge | Percentage of responses passing the gate |
deepeval_faithfulness | Histogram | Background faithfulness scores |
deepeval_relevancy | Histogram | Background relevancy scores |
feedback_positive_ratio | Gauge | User thumbs-up percentage |
Safety Metrics
| Metric | Type | Description |
|---|---|---|
safety_blocks_total | Counter | Intent-based safety blocks |
safety_validation_blocks | Counter | Post-generation safety blocks |
medical_advice_incidents | Counter | Target: permanently at ZERO |
Caching Strategy
The semantic query cache operates in PostgreSQL via pgvector (ADR-0031), providing a two-tier lookup system that operates on LLM-reformulated queries for maximum hit rates across languages:
| Cache Level | Mechanism | Latency | Hit Rate |
|---|---|---|---|
| Tier 1: Exact hash | SHA-256 of reformulated query | ~1ms | ~30% |
| Tier 2: Semantic similarity | pgvector HNSW cosine similarity >= 0.97 | ~30ms | ~10-15% additional |
| Session context (Redis) | In-memory conversation history | < 1ms | Per-user |
The semantic cache was migrated from Redis to PostgreSQL (ADR-0031) because Tier 2 approximate nearest neighbor search requires pgvector's HNSW index. Redis continues to serve ephemeral state (sessions, rate limits, token blacklist). See Storage Architecture for schema details.
Performance Targets
| Metric | Target | Current measured | Source / status |
|---|---|---|---|
| P50 latency | < 10 s (@nielsen1993responsetimes) | 7.8 s | thesis Chapter 4, Table 4.3 — Met |
| P90 latency | < 15 s | 12.2 s | Same source — Met |
| P99 latency | not yet targeted | 20.9 s | Same source — Reported, no SLO yet |
| Cache hit rate | > 30 % | ~40 % | Internal telemetry — Exceeded |
| Quality gate pass rate | > 85 % | ~89 % | Internal telemetry — Met |
| Pass rate (302-q v3.6) | ≥ 95 % | 99.0 % (296/299) | thesis Chapter 4, Table 4.1 — Exceeded |
| Monthly LLM cost | < $20 | ~$8.70 | Internal cost-tracking — Exceeded |
Benchmarking Methodology
Performance measurements follow standard practices for latency benchmarking:
- Measurement point: End-to-end from WebSocket message receipt to final streaming chunk delivery
- Statistical reporting: P50 (median), P95, and P99 latencies to capture tail behavior
- Warm cache exclusion: Cache hit responses are excluded from latency measurements to reflect true pipeline performance
- Repeat measurements: Each benchmark reports the mean of 10 consecutive queries with the same input to account for LLM response variability
- Environment: Measurements taken on the development environment (local Ollama, OpenRouter for cloud LLM models)