Skip to main content

ADR-0054: Intent Classification Cache

Date: 2026-05-12 | Status: Accepted | Relates to: ADR-0031 Semantic Query Cache, ADR-0030 (LLM Entity Extraction)

Context

Per-stage telemetry showed intent classification dominating chat-pipeline latency at p50 ≈ 2,300 ms × every query — a single OpenAI gpt-4.1 call that runs on every turn before retrieval. The Semantic Query Cache already eliminates the full pipeline cost for exact-string answer repeats, but on a Tier-1 miss every step (including intent classification) is rerun fresh.

Production traffic analysis (2026-05-11) showed a meaningful fraction of queries are repeat phrasings of common questions. The intent classifier returns deterministic output at temperature=0.1: identical input produces identical output. There is no reason to pay 2,300 ms for the same answer twice.

Decision

A separate cache layer keyed on (tenant_id, normalized_query, language) storing the full IntentClassificationResult Pydantic model. The cache is deliberately decoupled from the semantic query cache because their failure modes differ:

CacheWhat it storesWrong cache hit looks like
Semantic query cacheFull LLM answerWrong content in the response
Intent classification cacheRouting decision (intent class + strategy)Wrong route — retrieval gets bad top-K, prompt uses wrong template

Two backends share one async interface (IntentCacheBackend Protocol):

Memory backend (default)

Per-worker OrderedDict bounded LRU with TTL. Process restart clears the cache — a safety property that bounds poisoning to one container lifecycle. Used on the single-worker pilot.

Redis backend (opt-in via INTENT_CACHE_BACKEND=redis)

Shared across worker replicas via the existing app.db.redis connection pool. Values are JSON round-tripped through Pydantic. Keys are prefixed intent_cache: so a SCAN-based clear targets only this cache without disturbing rate-limiter or token-blacklist keys.

The Redis backend persists across container restarts — meaning the "restart fixes poisoning" remedy that works for the memory backend no longer applies. The compensating control bundled in the same PR (f9a335c4) is the operator "Clear Cache" button on PlatformSettingsPage, which wipes both the intent cache AND the semantic query cache in a single click (POST /api/v1/settings/cache/clear).

Poisoning Guard

IntentClassificationService writes to the cache only when:

  • result.confidence >= INTENT_CACHE_CONFIDENCE_THRESHOLD (default 0.85)
  • result.intent != UserIntent.UNKNOWN

These guards live in the caller, so they apply regardless of backend choice.

Resilience

Every Redis operation is wrapped in try/except and falls back to a cache miss on failure. A Redis outage degrades the system to "every query pays the 2,300 ms LLM cost" — not "the system crashes." If the Redis connection pool is uninitialised when get_intent_cache() is first called with INTENT_CACHE_BACKEND=redis, the singleton falls back to the memory backend with a startup-time warning.

Configuration

Env variableDefaultRangePurpose
INTENT_CACHE_ENABLEDtrueboolMaster switch
INTENT_CACHE_BACKENDmemorymemory / redisPick backend
INTENT_CACHE_MAX_SIZE100010–100000Memory backend LRU bound
INTENT_CACHE_TTL_SECONDS360060–604800Both backends
INTENT_CACHE_CONFIDENCE_THRESHOLD0.850.0–1.0Poisoning guard

Consequences

Positive

  • Cache hit removes ~2,300 ms from the per-turn latency budget. Stacks with the semantic_query_cache — if both hit, the full pipeline collapses to ~50 ms.
  • Backend selection is a runtime knob, not a code change. Single-worker pilot uses memory; multi-worker production flips to Redis without rebuilding the image.
  • Poisoning has a one-click remedy via the existing UI button — operators don't need shell access to recover from a cache-poisoning incident.

Negative

  • Two caches now share the same poisoning failure mode. In Redis mode, container restart no longer self-heals poisoning. This is the entire reason the UI kill switch was bundled in the same PR as the Redis backend, rather than deferred to a follow-up.
  • Marginal Redis pool pressure under heavy traffic — one additional GET per request that isn't already cached at the semantic layer. Pool defaults are sized for rate-limiting + token-blacklist + ingestion; redis_max_connections may need adjustment under future load profiles.
  • Cross-worker observability is per-process today. stats() returns per-worker counters; aggregate hit rate requires aggregation across workers. Future work would emit metrics to the existing pipeline_telemetry stream.

Alternatives Considered

AlternativeRejected because
Extend the semantic_query_cache to also cache intent resultsConflates two different failure modes and value types. The semantic cache key is a 1536-dim embedding of the reformulated query; the intent cache key is the raw user input. The semantic cache stores full answers; the intent cache stores classification objects. Reusing the table would have forced shared schema, eviction policy, and TTL semantics.
Cache intent results in PostgreSQLAdds a database round-trip on every classification — opposite of the goal. Redis hits in 1–2 ms locally; PG hits in 5–10 ms.
Embed the cache inside IntentClassificationServiceWould couple the cache to the service and prevent the clean Protocol-based backend swap. The current factory + Protocol design keeps the service agnostic to backend choice.

Verification

Live verification on pilot (zol-rag-app:f9a335c4, 2026-05-12):

# 1. Trigger a fresh intent classification
curl -X POST .../api/v1/query -d '{"query":"...","channel":"web"}'

# 2. Confirm Redis key written
redis-cli --scan --pattern "intent_cache:*"
# Returns: intent_cache:|nl|<normalized query>

Test coverage:

  • 12 unit tests on MemoryIntentCache (backend/tests/unit/services/test_intent_cache.py)
  • 13 integration tests on RedisIntentCache against a Redis 7 testcontainer (backend/tests/integration/test_redis_intent_cache.py)
  • 4 integration tests on the kill-switch endpoint (backend/tests/integration/api/test_settings_cache_clear.py)
  • ADR-0031: Semantic Query Cache — the other cache layer; both share the same UI kill switch
  • backend/app/services/intent_cache.py — implementation
  • backend/app/api/settings.pyPOST /api/v1/settings/cache/clear endpoint
  • System Overview — where this cache fits in the layered architecture