ADR-0023: Prompt Caching Optimization

Date: 2026-02-10 | Status: Deferred

Context

LLM API calls in the RAG pipeline (Lewis et al., 2020) share large repeated prompt prefixes:

System prompts: Safety rules, response formatting, medical disclaimer instructions (~500-1500 tokens)
Taxonomy context: Department lists, entity type rules injected into validation prompts (~800 tokens)
Document context: Retrieved chunks assembled into context blocks (~2000-4000 tokens per query)

Prompt caching allows LLM providers to cache these repeated prefixes across API calls, reducing both cost and latency. OpenAI provides automatic prompt caching (75% discount on cached input tokens for prompts >1024 tokens). Anthropic offers explicit cache control headers.

Decision

Defer explicit optimization. Rationale:

Automatic caching already active: OpenAI automatically caches prompts >1024 tokens. Our taxonomy-enriched validation prompts qualify and already receive cached input pricing (tracked via cached_tokens in CostTracker).
Current costs acceptable: With the tiered model migration (ADR-0015), ingestion costs dropped to ~~$10-11 per full run. Query-time costs are minimal (~~$0.001-0.003 per query).
Provider dependency: Explicit cache control differs across providers (OpenAI automatic vs Anthropic explicit headers vs local models). Optimizing for one provider creates lock-in.
PoC/demo scope: Current usage patterns (dev testing, demo sessions) do not generate enough volume for caching to meaningfully impact costs.

Consequences

Revisit when moving to production scale (thousands of queries/day)
Monitor cached_tokens metrics in CostTracker to understand current cache hit rates
Consider explicit Anthropic-style cache breakpoints if migrating to Claude for any pipeline stages

Alternatives Considered

Alternative 1: Explicit Cache Breakpoints Now

Add provider-specific cache control headers to all system prompts.

Pros: Maximum cost savings immediately
Cons: Provider-specific code paths, minimal savings at current volume
Why rejected: Premature optimization at PoC scale

Alternative 2: Semantic Prompt Deduplication

Hash and deduplicate prompt prefixes application-side before sending to LLM.

Pros: Provider-agnostic
Cons: Complex implementation, most providers already handle this server-side
Why rejected: Duplicates provider functionality without clear benefit

ADR-0015: Taxonomy-Driven Normalization and LLM Optimization (cost tracking, model routing)

Context​

Decision​

Consequences​

Alternatives Considered​

Alternative 1: Explicit Cache Breakpoints Now​

Alternative 2: Semantic Prompt Deduplication​

Related ADRs​