This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.
ADR-0019: Contextual Embeddings for Retrieval Quality
Date: 2026-02-10 | Status: Accepted
This ADR is the decision record. For the conceptual explanation of canonical questions and page summaries — how they work, why (HyDE-at-index-time + Anthropic contextual retrieval), and how they feed the query-time retrieval-steering triad — see Ingestion Enrichment.
Context
Raw text chunks lose context when separated from their parent document. A chunk about "visiting hours" conveys no information about which department or campus it belongs to. Anthropic's research on contextual retrieval demonstrates a -35% to -67% retrieval failure rate reduction when chunks are enriched with document-level context before embedding.
In the ZOL corpus, many chunks are extracted from lengthy pages covering multiple topics (e.g., a department page listing doctors, conditions, treatments, and practical information). Without situating context, the embedding captures only the local text, missing critical parent-document signals.
Decision
Implement Anthropic-style contextual retrieval for all document chunks during ingestion.
Chunk Context Generation
For each chunk, generate a 50-100 token context using the Tier 2 (standard) model that situates the chunk within its parent document. The context captures:
- Which document/page the chunk comes from
- The main topic of the parent document
- How this chunk relates to the overall document
Enriched Text Format
Prepend context and canonical questions to chunk text before embedding AND BM25 indexing:
{chunk_context}
{canonical_questions}
{original_text}
The raw text is stored unchanged in the content column. The enriched text is used only for embedding generation and BM25 indexing. A maximum character cap (3,000 chars) prevents the enriched text from exceeding the embedding model's effective window.
Cost Estimate
- Corpus size: ~18,600 chunks
- Model: Tier 2 (standard)
- Estimated cost: ~$2.50 for full corpus re-embedding
Implementation
In processing_service.py:
_generate_chunk_contexts()— batched LLM calls to generate situating context per chunk_build_enriched_text()— combines context + canonical questions + original text (with max_chars cap)_generate_canonical_questions_batch()— generates 1-2 questions each chunk could answer
Consequences
Positive
- Significant retrieval improvement: -35 to -67% retrieval failure rate (per Anthropic research)
- Better BM25 matching: Context terms (department names, page titles) appear in indexed text
- Better vector similarity: Embeddings capture document-level semantics alongside chunk content
- No runtime latency impact: Context is baked into embeddings at ingestion time
- Low cost: ~$0.63 for entire corpus using Tier 1 model
Negative
- Slower ingestion: Additional LLM call per chunk adds ~15-20 minutes to full ingestion
- Re-embedding required: Existing chunks need re-embedding after enabling contextual retrieval
- Storage increase: Enriched text is larger than raw text (stored in embedding input, not content column)
Neutral
- Query pipeline unchanged (searches against same pgvector/BM25 indexes)
- PostgreSQL taxonomy unchanged
- Raw chunk content preserved as-is
Alternatives Considered
Alternative 1: Document Title Prepend Only
Prepend only the document title to each chunk (no LLM call).
- Pros: Zero LLM cost, simple implementation
- Cons: Misses nuanced context, no canonical questions
- Why rejected: Anthropic research shows LLM-generated context significantly outperforms simple title prepend
Alternative 2: Hierarchical Embeddings (Parent + Child)
Store embeddings at both chunk and document level, retrieve by parent then refine by child.
- Pros: Captures both granular and broad context
- Cons: Complex retrieval logic, doubles storage, harder to tune ranking
- Why rejected: Contextual embeddings achieve similar benefits with simpler architecture
References
Context Filtering and Enrichment
The contextual embedding approach implemented here addresses the broader problem of context filtering in retrieval-augmented generation — determining which retrieved information is actually useful for the generation model. Wang et al. (2023) formalise this problem in FILCO (Learning to Filter Context for Retrieval-Augmented Generation), demonstrating that context filtering using lexical overlap and conditional cross-mutual information reduces prompt lengths by up to 64% while improving answer quality across extractive QA, multi-hop reasoning, and fact verification tasks.
The ZOL system addresses the same underlying problem from the ingestion side rather than the query side: instead of filtering retrieved context at query time (FILCO), we enrich chunk context at ingestion time (contextual embeddings), ensuring that the embedding itself captures the document-level signals needed for precise retrieval. These approaches are complementary — FILCO-style query-time filtering could further refine the context assembled from contextually-enriched chunks.
Günther et al. (2024) propose Late Chunking as an alternative approach: embedding entire documents using long-context models before splitting into chunks, thereby preserving cross-sentence context in the embedding space. While theoretically appealing, Late Chunking requires models with very long context windows and introduces architectural complexity. Our contextual embedding approach achieves a similar effect with simpler infrastructure.
- Anthropic. (2024). Introducing Contextual Retrieval. Anthropic Research Blog. — 49% retrieval failure reduction with contextual embeddings + hybrid search.
- Wang, Z., Araki, J., Jiang, Z., Parvez, M. R., & Neubig, G. (2023). Learning to Filter Context for Retrieval-Augmented Generation. arXiv preprint, arXiv:2311.08377. — FILCO: field-level context filtering reducing prompt length by 64%.
- Günther, M., Mohr, I., Williams, D. J., Wang, B., & Xiao, H. (2024). Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models. arXiv preprint, arXiv:2409.04701. — Alternative approach using long-context models for context-preserving embeddings.
- Vake, L., Stanny, O., & Guthrie, R. (2025). HyPE-RAG: Hypothetical Prompt Embeddings for Retrieval-Augmented Generation. SSRN Electronic Journal. — Hypothetical Prompt Embeddings for query-aligned chunk retrieval.
Related ADRs
- ADR-0048: OpenAI Embeddings Migration (current embedding model —
text-embedding-3-large, 1536 dim, hosted) - ADR-0033: BGE-M3 Embedding Migration (superseded by ADR-0048)
- ADR-0005: Original nomic-embed-text selection (superseded by ADR-0033)
- ADR-0014: LLM Entity Validation and Contextual Retrieval (initial contextual retrieval concept)
- ADR-0017: Context Retrieval Architecture (retrieval pipeline design)