Architectural Update (March 2026)

This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.

ADR-0019: Contextual Embeddings for Retrieval Quality

Date: 2026-02-10 | Status: Accepted

Conceptual companion

This ADR is the decision record. For the conceptual explanation of canonical questions and page summaries — how they work, why (HyDE-at-index-time + Anthropic contextual retrieval), and how they feed the query-time retrieval-steering triad — see Ingestion Enrichment.

Context

Raw text chunks lose context when separated from their parent document. A chunk about "visiting hours" conveys no information about which department or campus it belongs to. Anthropic's research on contextual retrieval demonstrates a -35% to -67% retrieval failure rate reduction when chunks are enriched with document-level context before embedding.

In the ZOL corpus, many chunks are extracted from lengthy pages covering multiple topics (e.g., a department page listing doctors, conditions, treatments, and practical information). Without situating context, the embedding captures only the local text, missing critical parent-document signals.

Decision

Implement Anthropic-style contextual retrieval for all document chunks during ingestion.

Chunk Context Generation

For each chunk, generate a 50-100 token context using the Tier 2 (standard) model that situates the chunk within its parent document. The context captures:

Which document/page the chunk comes from
The main topic of the parent document
How this chunk relates to the overall document

Enriched Text Format

Prepend context and canonical questions to chunk text before embedding AND BM25 indexing:

{chunk_context}
{canonical_questions}
{original_text}

The raw text is stored unchanged in the content column. The enriched text is used only for embedding generation and BM25 indexing. A maximum character cap (3,000 chars) prevents the enriched text from exceeding the embedding model's effective window.

Cost Estimate

Corpus size: ~18,600 chunks
Model: Tier 2 (standard)
Estimated cost: ~$2.50 for full corpus re-embedding

Implementation

In processing_service.py:

_generate_chunk_contexts() — batched LLM calls to generate situating context per chunk
_build_enriched_text() — combines context + canonical questions + original text (with max_chars cap)
_generate_canonical_questions_batch() — generates 1-2 questions each chunk could answer

Consequences

Positive

Significant retrieval improvement: -35 to -67% retrieval failure rate (per Anthropic research)
Better BM25 matching: Context terms (department names, page titles) appear in indexed text
Better vector similarity: Embeddings capture document-level semantics alongside chunk content
No runtime latency impact: Context is baked into embeddings at ingestion time
Low cost: ~$0.63 for entire corpus using Tier 1 model

Negative

Slower ingestion: Additional LLM call per chunk adds ~15-20 minutes to full ingestion
Re-embedding required: Existing chunks need re-embedding after enabling contextual retrieval
Storage increase: Enriched text is larger than raw text (stored in embedding input, not content column)

Neutral

Query pipeline unchanged (searches against same pgvector/BM25 indexes)
PostgreSQL taxonomy unchanged
Raw chunk content preserved as-is

Alternatives Considered

Alternative 1: Document Title Prepend Only

Prepend only the document title to each chunk (no LLM call).

Pros: Zero LLM cost, simple implementation
Cons: Misses nuanced context, no canonical questions
Why rejected: Anthropic research shows LLM-generated context significantly outperforms simple title prepend

Alternative 2: Hierarchical Embeddings (Parent + Child)

Store embeddings at both chunk and document level, retrieve by parent then refine by child.

Pros: Captures both granular and broad context
Cons: Complex retrieval logic, doubles storage, harder to tune ranking
Why rejected: Contextual embeddings achieve similar benefits with simpler architecture

References

Context Filtering and Enrichment

The contextual embedding approach implemented here addresses the broader problem of context filtering in retrieval-augmented generation — determining which retrieved information is actually useful for the generation model. Wang et al. (2023) formalise this problem in FILCO (Learning to Filter Context for Retrieval-Augmented Generation), demonstrating that context filtering using lexical overlap and conditional cross-mutual information reduces prompt lengths by up to 64% while improving answer quality across extractive QA, multi-hop reasoning, and fact verification tasks.

The ZOL system addresses the same underlying problem from the ingestion side rather than the query side: instead of filtering retrieved context at query time (FILCO), we enrich chunk context at ingestion time (contextual embeddings), ensuring that the embedding itself captures the document-level signals needed for precise retrieval. These approaches are complementary — FILCO-style query-time filtering could further refine the context assembled from contextually-enriched chunks.

Günther et al. (2024) propose Late Chunking as an alternative approach: embedding entire documents using long-context models before splitting into chunks, thereby preserving cross-sentence context in the embedding space. While theoretically appealing, Late Chunking requires models with very long context windows and introduces architectural complexity. Our contextual embedding approach achieves a similar effect with simpler infrastructure.

Anthropic. (2024). Introducing Contextual Retrieval. Anthropic Research Blog. — 49% retrieval failure reduction with contextual embeddings + hybrid search.
Wang, Z., Araki, J., Jiang, Z., Parvez, M. R., & Neubig, G. (2023). Learning to Filter Context for Retrieval-Augmented Generation. arXiv preprint, arXiv:2311.08377. — FILCO: field-level context filtering reducing prompt length by 64%.
Günther, M., Mohr, I., Williams, D. J., Wang, B., & Xiao, H. (2024). Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models. arXiv preprint, arXiv:2409.04701. — Alternative approach using long-context models for context-preserving embeddings.
Vake, L., Stanny, O., & Guthrie, R. (2025). HyPE-RAG: Hypothetical Prompt Embeddings for Retrieval-Augmented Generation. SSRN Electronic Journal. — Hypothetical Prompt Embeddings for query-aligned chunk retrieval.

ADR-0048: OpenAI Embeddings Migration (current embedding model — text-embedding-3-large, 1536 dim, hosted)
ADR-0033: BGE-M3 Embedding Migration (superseded by ADR-0048)
ADR-0005: Original nomic-embed-text selection (superseded by ADR-0033)
ADR-0014: LLM Entity Validation and Contextual Retrieval (initial contextual retrieval concept)
ADR-0017: Context Retrieval Architecture (retrieval pipeline design)

Context​

Decision​

Chunk Context Generation​

Enriched Text Format​

Cost Estimate​

Implementation​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

Alternative 1: Document Title Prepend Only​

Alternative 2: Hierarchical Embeddings (Parent + Child)​

References​

Context Filtering and Enrichment​

Related ADRs​