Skip to main content
Architectural Update (March 2026)

This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.

ADR-0019: Contextual Embeddings for Retrieval Quality

Date: 2026-02-10 | Status: Accepted

Conceptual companion

This ADR is the decision record. For the conceptual explanation of canonical questions and page summaries — how they work, why (HyDE-at-index-time + Anthropic contextual retrieval), and how they feed the query-time retrieval-steering triad — see Ingestion Enrichment.

Context

Raw text chunks lose context when separated from their parent document. A chunk about "visiting hours" conveys no information about which department or campus it belongs to. Anthropic's research on contextual retrieval demonstrates a -35% to -67% retrieval failure rate reduction when chunks are enriched with document-level context before embedding.

In the ZOL corpus, many chunks are extracted from lengthy pages covering multiple topics (e.g., a department page listing doctors, conditions, treatments, and practical information). Without situating context, the embedding captures only the local text, missing critical parent-document signals.

Decision

Implement Anthropic-style contextual retrieval for all document chunks during ingestion.

Chunk Context Generation

For each chunk, generate a 50-100 token context using the Tier 2 (standard) model that situates the chunk within its parent document. The context captures:

  • Which document/page the chunk comes from
  • The main topic of the parent document
  • How this chunk relates to the overall document

Enriched Text Format

Prepend context and canonical questions to chunk text before embedding AND BM25 indexing:

{chunk_context}
{canonical_questions}
{original_text}

The raw text is stored unchanged in the content column. The enriched text is used only for embedding generation and BM25 indexing. A maximum character cap (3,000 chars) prevents the enriched text from exceeding the embedding model's effective window.

Cost Estimate

  • Corpus size: ~18,600 chunks
  • Model: Tier 2 (standard)
  • Estimated cost: ~$2.50 for full corpus re-embedding

Implementation

In processing_service.py:

  • _generate_chunk_contexts() — batched LLM calls to generate situating context per chunk
  • _build_enriched_text() — combines context + canonical questions + original text (with max_chars cap)
  • _generate_canonical_questions_batch() — generates 1-2 questions each chunk could answer

Consequences

Positive

  • Significant retrieval improvement: -35 to -67% retrieval failure rate (per Anthropic research)
  • Better BM25 matching: Context terms (department names, page titles) appear in indexed text
  • Better vector similarity: Embeddings capture document-level semantics alongside chunk content
  • No runtime latency impact: Context is baked into embeddings at ingestion time
  • Low cost: ~$0.63 for entire corpus using Tier 1 model

Negative

  • Slower ingestion: Additional LLM call per chunk adds ~15-20 minutes to full ingestion
  • Re-embedding required: Existing chunks need re-embedding after enabling contextual retrieval
  • Storage increase: Enriched text is larger than raw text (stored in embedding input, not content column)

Neutral

  • Query pipeline unchanged (searches against same pgvector/BM25 indexes)
  • PostgreSQL taxonomy unchanged
  • Raw chunk content preserved as-is

Alternatives Considered

Alternative 1: Document Title Prepend Only

Prepend only the document title to each chunk (no LLM call).

  • Pros: Zero LLM cost, simple implementation
  • Cons: Misses nuanced context, no canonical questions
  • Why rejected: Anthropic research shows LLM-generated context significantly outperforms simple title prepend

Alternative 2: Hierarchical Embeddings (Parent + Child)

Store embeddings at both chunk and document level, retrieve by parent then refine by child.

  • Pros: Captures both granular and broad context
  • Cons: Complex retrieval logic, doubles storage, harder to tune ranking
  • Why rejected: Contextual embeddings achieve similar benefits with simpler architecture

References

Context Filtering and Enrichment

The contextual embedding approach implemented here addresses the broader problem of context filtering in retrieval-augmented generation — determining which retrieved information is actually useful for the generation model. Wang et al. (2023) formalise this problem in FILCO (Learning to Filter Context for Retrieval-Augmented Generation), demonstrating that context filtering using lexical overlap and conditional cross-mutual information reduces prompt lengths by up to 64% while improving answer quality across extractive QA, multi-hop reasoning, and fact verification tasks.

The ZOL system addresses the same underlying problem from the ingestion side rather than the query side: instead of filtering retrieved context at query time (FILCO), we enrich chunk context at ingestion time (contextual embeddings), ensuring that the embedding itself captures the document-level signals needed for precise retrieval. These approaches are complementary — FILCO-style query-time filtering could further refine the context assembled from contextually-enriched chunks.

Günther et al. (2024) propose Late Chunking as an alternative approach: embedding entire documents using long-context models before splitting into chunks, thereby preserving cross-sentence context in the embedding space. While theoretically appealing, Late Chunking requires models with very long context windows and introduces architectural complexity. Our contextual embedding approach achieves a similar effect with simpler infrastructure.

  • ADR-0048: OpenAI Embeddings Migration (current embedding modeltext-embedding-3-large, 1536 dim, hosted)
  • ADR-0033: BGE-M3 Embedding Migration (superseded by ADR-0048)
  • ADR-0005: Original nomic-embed-text selection (superseded by ADR-0033)
  • ADR-0014: LLM Entity Validation and Contextual Retrieval (initial contextual retrieval concept)
  • ADR-0017: Context Retrieval Architecture (retrieval pipeline design)