Skip to main content

ADR-0023: Prompt Caching Optimization

Date: 2026-02-10 | Status: Deferred

Context

LLM API calls in the RAG pipeline (Lewis et al., 2020) share large repeated prompt prefixes:

  • System prompts: Safety rules, response formatting, medical disclaimer instructions (~500-1500 tokens)
  • Taxonomy context: Department lists, entity type rules injected into validation prompts (~800 tokens)
  • Document context: Retrieved chunks assembled into context blocks (~2000-4000 tokens per query)

Prompt caching allows LLM providers to cache these repeated prefixes across API calls, reducing both cost and latency. OpenAI provides automatic prompt caching (75% discount on cached input tokens for prompts >1024 tokens). Anthropic offers explicit cache control headers.

Decision

Defer explicit optimization. Rationale:

  1. Automatic caching already active: OpenAI automatically caches prompts >1024 tokens. Our taxonomy-enriched validation prompts qualify and already receive cached input pricing (tracked via cached_tokens in CostTracker).

  2. Current costs acceptable: With the tiered model migration (ADR-0015), ingestion costs dropped to $10-11 per full run. Query-time costs are minimal ($0.001-0.003 per query).

  3. Provider dependency: Explicit cache control differs across providers (OpenAI automatic vs Anthropic explicit headers vs local models). Optimizing for one provider creates lock-in.

  4. PoC/demo scope: Current usage patterns (dev testing, demo sessions) do not generate enough volume for caching to meaningfully impact costs.

Consequences

  • Revisit when moving to production scale (thousands of queries/day)
  • Monitor cached_tokens metrics in CostTracker to understand current cache hit rates
  • Consider explicit Anthropic-style cache breakpoints if migrating to Claude for any pipeline stages

Alternatives Considered

Alternative 1: Explicit Cache Breakpoints Now

Add provider-specific cache control headers to all system prompts.

  • Pros: Maximum cost savings immediately
  • Cons: Provider-specific code paths, minimal savings at current volume
  • Why rejected: Premature optimization at PoC scale

Alternative 2: Semantic Prompt Deduplication

Hash and deduplicate prompt prefixes application-side before sending to LLM.

  • Pros: Provider-agnostic
  • Cons: Complex implementation, most providers already handle this server-side
  • Why rejected: Duplicates provider functionality without clear benefit
  • ADR-0015: Taxonomy-Driven Normalization and LLM Optimization (cost tracking, model routing)