ADR-0031: Two-Tier Semantic Query Cache
Date: 2026-02-14 | Status: Accepted | Relates to: ADR-0030 (LLM Entity Extraction)
The body below references the BGE-M3 embedder (1024 dim) that was in production at the time the cache was designed. The cache architecture itself is unchanged; only the embedding model has moved. Production now uses OpenAI text-embedding-3-large (1536 dim) per ADR-0048, so the schema column is vector(1536) NOT NULL in the live database, and any reference to "1024 dim BGE-M3" in this page should be read as "the embedder of the day at the time of the design". Cosine-similarity behaviour, the 0.97 threshold, and the two-tier lookup flow are all preserved.
Context
The ZOL RAG pipeline (Lewis et al., 2020) processes every query through an expensive chain: intent classification (~200ms) → embedding + retrieval (~300ms) → LLM generation (~2-5s) → safety validation (~500ms). Total: 3-6 seconds per query.
ADR-0030 introduced LLM-first query normalization. The intent classifier now reformulates every query into well-formed Dutch clinical language, collapsing many distinct user queries (across languages) into identical or near-identical strings — a natural cache normalization layer. This cache is built directly on top of that step: it keys on the rewritten query, never the raw input. For how rewriting works (canonical templates, follow-up resolution), see the Query Rewriting page; this ADR covers only how the cache consumes its output.
| Original (multilingual) | Reformulated (Dutch clinical) |
|---|---|
| "rugpijn" | "Welke afdelingen bij ZOL behandelen rugpijn?" |
| "back pain" | "Welke afdelingen bij ZOL behandelen rugpijn?" |
| "sırt ağrısı" (Turkish) | "Welke afdelingen bij ZOL behandelen rugpijn?" |
Decision
Implement a two-tier semantic query cache stored in PostgreSQL (using the existing pgvector infrastructure), operating on the reformulated query rather than raw user input.
Tier 1: Exact Hash (~1ms)
SHA-256 hash of reformulated_query + "|lang=" + detected_language. Since the LLM normalizes language, spelling, and phrasing, many different original queries produce the same reformulated query. Including the detected language in the hash ensures that responses generated in different languages are cached separately. This catches the majority of cache-eligible queries, including cross-language duplicates that share the same target language.
Tier 2: Embedding Similarity (~30ms)
For Tier 1 misses, embed the reformulated query using BGE-M3 (1024 dimensions) and perform a pgvector cosine similarity search against cached embeddings. If the nearest cached entry has similarity ≥ threshold (default: 0.97), it is a cache hit.
This catches near-miss phrasings that the LLM reformulated differently but that mean the same thing:
- "Welke artsen werken op orthopedie?" vs "Welke dokters zijn er op de afdeling orthopedie?"
- Cosine similarity ≈ 0.974 → cache hit
Why PostgreSQL Instead of Redis?
| Criterion | PostgreSQL + pgvector | Redis |
|---|---|---|
| Vector similarity search | Native HNSW support | Requires Redis Stack (not deployed) |
| Persistence | Survives restarts | Volatile without AOF/RDB |
| Multi-tenant queries | SQL WHERE clauses | Manual key namespacing |
| Already deployed | Yes (vector store) | Yes (sessions only) |
The decisive factor is Tier 2: approximate nearest neighbor search on embeddings requires pgvector's HNSW index.
Cache Schema
CREATE TABLE app.semantic_query_cache (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
query_hash VARCHAR(64) NOT NULL, -- SHA-256 of reformulated_query + lang
reformulated_query TEXT NOT NULL,
detected_language VARCHAR(10), -- ISO language code (e.g. 'nl', 'en')
original_query TEXT NOT NULL,
query_embedding vector(1024) NOT NULL, -- BGE-M3
response_text TEXT NOT NULL,
citations JSONB,
intent VARCHAR(50),
hit_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ NOT NULL,
CONSTRAINT uq_tenant_query_hash UNIQUE (tenant_id, query_hash)
);
Lookup Flow
Intent classification still runs on every request (needed to get the reformulated query). The cache saves the expensive retrieval + LLM generation + safety steps (~3-5s).
Admin Controls
Cache management is exposed in the admin Settings UI:
| Control | Description |
|---|---|
| Toggle | Enable/disable query cache (runtime, per-tenant) |
| Clear Cache | Wipe all cached entries for the tenant |
| Cache Stats | Hits, misses, hit rate, entry count |
| Similarity Threshold | Configurable slider (default 0.97, range 0.80-1.00) |
Configuration
| Setting | Default | Description |
|---|---|---|
query_cache_enabled | true | Enable/disable the entire cache |
query_cache_ttl_seconds | 3600 (1 hour) | Time-to-live for cache entries |
query_cache_similarity_threshold | 0.97 | Minimum cosine similarity for Tier 2 hits |
query_cache_max_size | 1000 | Maximum number of cached entries per tenant |
Consequences
Positive
- Latency reduction on cache hits: ~3-5s → ~1ms (Tier 1) or ~30ms (Tier 2)
- Cross-language caching: Turkish query hits Dutch-cached response for free (via LLM reformulation)
- Cost reduction: Each cache hit saves LLM generation, retrieval, and safety validation calls (~$0.003/query)
- Admin testability: Toggle, clear, view stats — enables meaningful A/B testing
Negative
- Intent classification still required: ~200ms on every request (needed for reformulation)
- Stale responses: Cached responses don't reflect graph/document changes until TTL expires. Mitigated by admin clear-cache button and 1-hour default TTL.
Risks
- Similarity threshold tuning: Too low → wrong cached response. Too high → rarely hits. Default 0.97 is tuned empirically; adjustable via admin UI.
- Cache pollution: A bad LLM response gets cached. Mitigated by quality gate running before caching and TTL expiration.
Alternatives Considered
- Redis-only exact hash (previous): Misses semantically identical queries. Hit rate < 5%.
- Redis Stack with RediSearch: Supports vectors but requires changing Docker image. Unnecessary given existing pgvector.
- Embedding of original query (not reformulated): Lower hit rate — multilingual/colloquial queries produce distant embeddings.
With 25,000 monthly queries and an estimated 40-60% cache hit rate, caching saves approximately $30-45/month in API costs, plus reducing latency from ~5.5s to < 50ms.