Skip to main content

ADR-0031: Two-Tier Semantic Query Cache

Date: 2026-02-14 | Status: Accepted | Relates to: ADR-0030 (LLM Entity Extraction)

Embedding-model context

The body below references the BGE-M3 embedder (1024 dim) that was in production at the time the cache was designed. The cache architecture itself is unchanged; only the embedding model has moved. Production now uses OpenAI text-embedding-3-large (1536 dim) per ADR-0048, so the schema column is vector(1536) NOT NULL in the live database, and any reference to "1024 dim BGE-M3" in this page should be read as "the embedder of the day at the time of the design". Cosine-similarity behaviour, the 0.97 threshold, and the two-tier lookup flow are all preserved.

Context

The ZOL RAG pipeline (Lewis et al., 2020) processes every query through an expensive chain: intent classification (~200ms) → embedding + retrieval (~300ms) → LLM generation (~2-5s) → safety validation (~500ms). Total: 3-6 seconds per query.

ADR-0030 introduced LLM-first query normalization. The intent classifier now reformulates every query into well-formed Dutch clinical language, collapsing many distinct user queries (across languages) into identical or near-identical strings — a natural cache normalization layer. This cache is built directly on top of that step: it keys on the rewritten query, never the raw input. For how rewriting works (canonical templates, follow-up resolution), see the Query Rewriting page; this ADR covers only how the cache consumes its output.

Original (multilingual)Reformulated (Dutch clinical)
"rugpijn""Welke afdelingen bij ZOL behandelen rugpijn?"
"back pain""Welke afdelingen bij ZOL behandelen rugpijn?"
"sırt ağrısı" (Turkish)"Welke afdelingen bij ZOL behandelen rugpijn?"

Decision

Implement a two-tier semantic query cache stored in PostgreSQL (using the existing pgvector infrastructure), operating on the reformulated query rather than raw user input.

Tier 1: Exact Hash (~1ms)

SHA-256 hash of reformulated_query + "|lang=" + detected_language. Since the LLM normalizes language, spelling, and phrasing, many different original queries produce the same reformulated query. Including the detected language in the hash ensures that responses generated in different languages are cached separately. This catches the majority of cache-eligible queries, including cross-language duplicates that share the same target language.

Tier 2: Embedding Similarity (~30ms)

For Tier 1 misses, embed the reformulated query using BGE-M3 (1024 dimensions) and perform a pgvector cosine similarity search against cached embeddings. If the nearest cached entry has similarity ≥ threshold (default: 0.97), it is a cache hit.

This catches near-miss phrasings that the LLM reformulated differently but that mean the same thing:

  • "Welke artsen werken op orthopedie?" vs "Welke dokters zijn er op de afdeling orthopedie?"
  • Cosine similarity ≈ 0.974 → cache hit

Why PostgreSQL Instead of Redis?

CriterionPostgreSQL + pgvectorRedis
Vector similarity searchNative HNSW supportRequires Redis Stack (not deployed)
PersistenceSurvives restartsVolatile without AOF/RDB
Multi-tenant queriesSQL WHERE clausesManual key namespacing
Already deployedYes (vector store)Yes (sessions only)

The decisive factor is Tier 2: approximate nearest neighbor search on embeddings requires pgvector's HNSW index.

Cache Schema

CREATE TABLE app.semantic_query_cache (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
query_hash VARCHAR(64) NOT NULL, -- SHA-256 of reformulated_query + lang
reformulated_query TEXT NOT NULL,
detected_language VARCHAR(10), -- ISO language code (e.g. 'nl', 'en')
original_query TEXT NOT NULL,
query_embedding vector(1024) NOT NULL, -- BGE-M3
response_text TEXT NOT NULL,
citations JSONB,
intent VARCHAR(50),
hit_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ NOT NULL,
CONSTRAINT uq_tenant_query_hash UNIQUE (tenant_id, query_hash)
);

Lookup Flow

Important

Intent classification still runs on every request (needed to get the reformulated query). The cache saves the expensive retrieval + LLM generation + safety steps (~3-5s).

Admin Controls

Cache management is exposed in the admin Settings UI:

ControlDescription
ToggleEnable/disable query cache (runtime, per-tenant)
Clear CacheWipe all cached entries for the tenant
Cache StatsHits, misses, hit rate, entry count
Similarity ThresholdConfigurable slider (default 0.97, range 0.80-1.00)

Configuration

SettingDefaultDescription
query_cache_enabledtrueEnable/disable the entire cache
query_cache_ttl_seconds3600 (1 hour)Time-to-live for cache entries
query_cache_similarity_threshold0.97Minimum cosine similarity for Tier 2 hits
query_cache_max_size1000Maximum number of cached entries per tenant

Consequences

Positive

  • Latency reduction on cache hits: ~3-5s → ~1ms (Tier 1) or ~30ms (Tier 2)
  • Cross-language caching: Turkish query hits Dutch-cached response for free (via LLM reformulation)
  • Cost reduction: Each cache hit saves LLM generation, retrieval, and safety validation calls (~$0.003/query)
  • Admin testability: Toggle, clear, view stats — enables meaningful A/B testing

Negative

  • Intent classification still required: ~200ms on every request (needed for reformulation)
  • Stale responses: Cached responses don't reflect graph/document changes until TTL expires. Mitigated by admin clear-cache button and 1-hour default TTL.

Risks

  • Similarity threshold tuning: Too low → wrong cached response. Too high → rarely hits. Default 0.97 is tuned empirically; adjustable via admin UI.
  • Cache pollution: A bad LLM response gets cached. Mitigated by quality gate running before caching and TTL expiration.

Alternatives Considered

  1. Redis-only exact hash (previous): Misses semantically identical queries. Hit rate < 5%.
  2. Redis Stack with RediSearch: Supports vectors but requires changing Docker image. Unnecessary given existing pgvector.
  3. Embedding of original query (not reformulated): Lower hit rate — multilingual/colloquial queries produce distant embeddings.
Cost Optimization

With 25,000 monthly queries and an estimated 40-60% cache hit rate, caching saves approximately $30-45/month in API costs, plus reducing latency from ~5.5s to < 50ms.