ADR-0023: Prompt Caching Optimization
Date: 2026-02-10 | Status: Deferred
Context
LLM API calls in the RAG pipeline (Lewis et al., 2020) share large repeated prompt prefixes:
- System prompts: Safety rules, response formatting, medical disclaimer instructions (~500-1500 tokens)
- Taxonomy context: Department lists, entity type rules injected into validation prompts (~800 tokens)
- Document context: Retrieved chunks assembled into context blocks (~2000-4000 tokens per query)
Prompt caching allows LLM providers to cache these repeated prefixes across API calls, reducing both cost and latency. OpenAI provides automatic prompt caching (75% discount on cached input tokens for prompts >1024 tokens). Anthropic offers explicit cache control headers.
Decision
Defer explicit optimization. Rationale:
-
Automatic caching already active: OpenAI automatically caches prompts >1024 tokens. Our taxonomy-enriched validation prompts qualify and already receive cached input pricing (tracked via
cached_tokensinCostTracker). -
Current costs acceptable: With the tiered model migration (ADR-0015), ingestion costs dropped to
$10-11 per full run. Query-time costs are minimal ($0.001-0.003 per query). -
Provider dependency: Explicit cache control differs across providers (OpenAI automatic vs Anthropic explicit headers vs local models). Optimizing for one provider creates lock-in.
-
PoC/demo scope: Current usage patterns (dev testing, demo sessions) do not generate enough volume for caching to meaningfully impact costs.
Consequences
- Revisit when moving to production scale (thousands of queries/day)
- Monitor
cached_tokensmetrics inCostTrackerto understand current cache hit rates - Consider explicit Anthropic-style cache breakpoints if migrating to Claude for any pipeline stages
Alternatives Considered
Alternative 1: Explicit Cache Breakpoints Now
Add provider-specific cache control headers to all system prompts.
- Pros: Maximum cost savings immediately
- Cons: Provider-specific code paths, minimal savings at current volume
- Why rejected: Premature optimization at PoC scale
Alternative 2: Semantic Prompt Deduplication
Hash and deduplicate prompt prefixes application-side before sending to LLM.
- Pros: Provider-agnostic
- Cons: Complex implementation, most providers already handle this server-side
- Why rejected: Duplicates provider functionality without clear benefit
Related ADRs
- ADR-0015: Taxonomy-Driven Normalization and LLM Optimization (cost tracking, model routing)