ADR-0054: Intent Classification Cache
Date: 2026-05-12 | Status: Accepted | Relates to: ADR-0031 Semantic Query Cache, ADR-0030 (LLM Entity Extraction)
Context
Per-stage telemetry showed intent classification dominating chat-pipeline latency at p50 ≈ 2,300 ms × every query — a single OpenAI gpt-4.1 call that runs on every turn before retrieval. The Semantic Query Cache already eliminates the full pipeline cost for exact-string answer repeats, but on a Tier-1 miss every step (including intent classification) is rerun fresh.
Production traffic analysis (2026-05-11) showed a meaningful fraction of queries are repeat phrasings of common questions. The intent classifier returns deterministic output at temperature=0.1: identical input produces identical output. There is no reason to pay 2,300 ms for the same answer twice.
Decision
A separate cache layer keyed on (tenant_id, normalized_query, language) storing the full IntentClassificationResult Pydantic model. The cache is deliberately decoupled from the semantic query cache because their failure modes differ:
| Cache | What it stores | Wrong cache hit looks like |
|---|---|---|
| Semantic query cache | Full LLM answer | Wrong content in the response |
| Intent classification cache | Routing decision (intent class + strategy) | Wrong route — retrieval gets bad top-K, prompt uses wrong template |
Two backends share one async interface (IntentCacheBackend Protocol):
Memory backend (default)
Per-worker OrderedDict bounded LRU with TTL. Process restart clears the cache — a safety property that bounds poisoning to one container lifecycle. Used on the single-worker pilot.
Redis backend (opt-in via INTENT_CACHE_BACKEND=redis)
Shared across worker replicas via the existing app.db.redis connection pool. Values are JSON round-tripped through Pydantic. Keys are prefixed intent_cache: so a SCAN-based clear targets only this cache without disturbing rate-limiter or token-blacklist keys.
The Redis backend persists across container restarts — meaning the "restart fixes poisoning" remedy that works for the memory backend no longer applies. The compensating control bundled in the same PR (f9a335c4) is the operator "Clear Cache" button on PlatformSettingsPage, which wipes both the intent cache AND the semantic query cache in a single click (POST /api/v1/settings/cache/clear).
Poisoning Guard
IntentClassificationService writes to the cache only when:
result.confidence >= INTENT_CACHE_CONFIDENCE_THRESHOLD(default0.85)result.intent != UserIntent.UNKNOWN
These guards live in the caller, so they apply regardless of backend choice.
Resilience
Every Redis operation is wrapped in try/except and falls back to a cache miss on failure. A Redis outage degrades the system to "every query pays the 2,300 ms LLM cost" — not "the system crashes." If the Redis connection pool is uninitialised when get_intent_cache() is first called with INTENT_CACHE_BACKEND=redis, the singleton falls back to the memory backend with a startup-time warning.
Configuration
| Env variable | Default | Range | Purpose |
|---|---|---|---|
INTENT_CACHE_ENABLED | true | bool | Master switch |
INTENT_CACHE_BACKEND | memory | memory / redis | Pick backend |
INTENT_CACHE_MAX_SIZE | 1000 | 10–100000 | Memory backend LRU bound |
INTENT_CACHE_TTL_SECONDS | 3600 | 60–604800 | Both backends |
INTENT_CACHE_CONFIDENCE_THRESHOLD | 0.85 | 0.0–1.0 | Poisoning guard |
Consequences
Positive
- Cache hit removes ~2,300 ms from the per-turn latency budget. Stacks with the semantic_query_cache — if both hit, the full pipeline collapses to ~50 ms.
- Backend selection is a runtime knob, not a code change. Single-worker pilot uses memory; multi-worker production flips to Redis without rebuilding the image.
- Poisoning has a one-click remedy via the existing UI button — operators don't need shell access to recover from a cache-poisoning incident.
Negative
- Two caches now share the same poisoning failure mode. In Redis mode, container restart no longer self-heals poisoning. This is the entire reason the UI kill switch was bundled in the same PR as the Redis backend, rather than deferred to a follow-up.
- Marginal Redis pool pressure under heavy traffic — one additional GET per request that isn't already cached at the semantic layer. Pool defaults are sized for rate-limiting + token-blacklist + ingestion;
redis_max_connectionsmay need adjustment under future load profiles. - Cross-worker observability is per-process today.
stats()returns per-worker counters; aggregate hit rate requires aggregation across workers. Future work would emit metrics to the existingpipeline_telemetrystream.
Alternatives Considered
| Alternative | Rejected because |
|---|---|
| Extend the semantic_query_cache to also cache intent results | Conflates two different failure modes and value types. The semantic cache key is a 1536-dim embedding of the reformulated query; the intent cache key is the raw user input. The semantic cache stores full answers; the intent cache stores classification objects. Reusing the table would have forced shared schema, eviction policy, and TTL semantics. |
| Cache intent results in PostgreSQL | Adds a database round-trip on every classification — opposite of the goal. Redis hits in 1–2 ms locally; PG hits in 5–10 ms. |
Embed the cache inside IntentClassificationService | Would couple the cache to the service and prevent the clean Protocol-based backend swap. The current factory + Protocol design keeps the service agnostic to backend choice. |
Verification
Live verification on pilot (zol-rag-app:f9a335c4, 2026-05-12):
# 1. Trigger a fresh intent classification
curl -X POST .../api/v1/query -d '{"query":"...","channel":"web"}'
# 2. Confirm Redis key written
redis-cli --scan --pattern "intent_cache:*"
# Returns: intent_cache:|nl|<normalized query>
Test coverage:
- 12 unit tests on
MemoryIntentCache(backend/tests/unit/services/test_intent_cache.py) - 13 integration tests on
RedisIntentCacheagainst a Redis 7 testcontainer (backend/tests/integration/test_redis_intent_cache.py) - 4 integration tests on the kill-switch endpoint (
backend/tests/integration/api/test_settings_cache_clear.py)
Related
- ADR-0031: Semantic Query Cache — the other cache layer; both share the same UI kill switch
backend/app/services/intent_cache.py— implementationbackend/app/api/settings.py—POST /api/v1/settings/cache/clearendpoint- System Overview — where this cache fits in the layered architecture