Ingestion Enrichment: Canonical Questions & Page Summaries
The retrieval-steering triad (Value Framework, Taxonomy, SNOMED) operates at query time. This page covers their ingestion-time complement: two LLM-generated artifacts that are baked into the search index before any query arrives, so that the query-time machinery has richer material to retrieve and rank against.
Both artifacts attack the same root problem — a 350-token chunk, severed from its parent document, loses the context needed to retrieve it well — but from two different angles:
| Artifact | What it is | The angle it attacks | Stored in | How it is used |
|---|---|---|---|---|
| Chunk context | A 50–100 token LLM blurb situating this chunk within its parent document | The chunk doesn't say which department/page it belongs to | metadata.chunk_context (audit copy) | Baked into embedding + search_vector via enriched text; the stored copy is not read back |
| Canonical questions | 1–2 Dutch questions this chunk could answer | Users search with questions; the corpus is written as statements | metadata.canonical_questions (audit copy) | Baked into embedding + search_vector via enriched text; the stored copy is not read back |
| Page summary | A 2–3 sentence Dutch description of the whole page | Even an enriched chunk lacks the document's overall framing | metadata.page_summary | Read back at query time and prepended to the first chunk during context assembly |
Two of the three artifacts are stored but never read back — chunk context and canonical questions influence retrieval only through the embedding and search_vector they were folded into at ingest. Their metadata copies exist purely for auditing and gap-backfill. The page summary is the only enrichment artifact the query path reads out of metadata again. Get this right and the rest of the page follows; the precise schema map is below.
This is the upstream companion to the query-time triad. Read it after the Core Concepts overview and before (or alongside) the Taxonomy: enrichment populates the document index; the triad steers retrieval over that index. For the full procedural pipeline see Document Ingestion Pipeline (Steps 7–8); for the decision record see ADR-0019 Contextual Embeddings.
Why ingestion-time enrichment at all?
A retrieval index is only as good as the text it indexes. Two structural mismatches degrade naive chunk indexing:
-
The context-loss mismatch. Chunking a department page into 350-token windows produces fragments like "De raadpleging duurt gemiddeld 30 minuten. Breng uw identiteitskaart mee." — which department? The embedding captures only the local words, so a query about cardiology consultations may never surface it. Anthropic's contextual retrieval research measured a 35–67 % reduction in retrieval failure when chunks are enriched with document-level context before embedding (see ADR-0019).
-
The question–statement mismatch. Hospital content is written as declarations ("De dienst Cardiologie behandelt hartfalen."); users type questions ("Bij wie moet ik zijn voor hartfalen?"). In embedding space, a question and its answering statement are near but not identical. Canonical questions close this gap by generating the likely questions at index time and embedding them alongside the chunk — a form of HyDE applied at indexing rather than query time (Gao et al.'s Hypothetical Document Embeddings, inverted).
Solving these at ingestion has a decisive property: zero added query-time latency. The LLM cost is paid once, during the nightly ingest, and amortised over every subsequent query. This is the same fast-path economics the SNOMED synonym cache uses — precompute the expensive thing offline.
The two artifacts in detail
Canonical questions (HyDE-at-index-time)
For each chunk, the Tier 2 model generates 1–2 Dutch questions the chunk answers:
Prompt: Gegeven deze tekst van een ziekenhuiswebsite, genereer 1-2 vragen (in het Nederlands) die door deze tekst beantwoord worden.
| Chunk content | Generated canonical question |
|---|---|
| Visiting-hours paragraph on the cardiology page | "Wat zijn de bezoekuren op cardiologie?" |
| Dr. Van den Berg's profile | "Wie is de orthopedisch chirurg bij ZOL?" |
The theoretical basis is HyDE (Hypothetical Document Embeddings, Gao et al. 2022): instead of embedding the query directly, embed a hypothetical answer and match on that, because answer-to-answer similarity beats question-to-answer similarity. Standard HyDE does this at query time (one LLM call per query). ZOL inverts it — generating the hypothetical questions per chunk at index time — so the alignment benefit is captured with no per-query cost. See Query Enrichment for the query-side counterpart.
Page summaries (Anthropic contextual retrieval)
During graph extraction, the LLM produces a 2–3 sentence Dutch description of the entire page:
[Pagina context: Deze pagina beschrijft de afdeling Cardiologie van ZOL, inclusief de artsen, behandelingen en consultatiemogelijkheden op campus Sint-Jan.]
This is generated for every page — even non-hub pages that write no entities to the taxonomy — precisely so that all chunks gain contextual grounding at query time. It is stored in chunk_metadata.page_summary and, unlike the other two artifacts, is not baked into the embedding: it is prepended to the first retrieved chunk of each document during context assembly Stage 5, exactly once per document, to avoid token-budget waste.
Both situate a chunk in its document, but they differ in scope and consumption. Chunk context is per-chunk, written at ~50–100 tokens, and embedded (it changes what the vector index matches). The page summary is per-document, ~2–3 sentences, and injected into the prompt at query time (it changes what the LLM sees, not what the index matches). They are stored under distinct JSONB keys — metadata.chunk_context vs metadata.page_summary — written at different pipeline stages (chunk context during chunk processing; page summary during graph extraction). The one exception is the enrichment-backfill path (_backfill_chunk_enrichment): when it retries a chunk whose context failed to generate on the first pass, it writes the recovered per-chunk context into both chunk_context and page_summary keys. So a backfilled chunk can carry a page_summary that is actually its chunk-level context. See ADR-0014 and ADR-0019 for the lineage.
The enriched-text format: one input, two indexes
Chunk context and canonical questions are not stored as the chunk — they are concatenated into an enriched text that is fed to both the embedding model and the BM25 tsvector builder. The raw chunk is preserved verbatim in the content column and is what the LLM ultimately quotes.
enriched_text =
{chunk_context} ← situates the chunk in its document
{canonical_questions} ← the questions this chunk answers
{original_chunk_text} ← never truncated; the substantive content
The raw chunk is always included in full; only the context and questions are truncated if the combined text would exceed the embedding model's effective window (the 3,000-char cap in _build_enriched_text). Embedding both lanes from the same enriched text is what makes hybrid search benefit on both the dense (vector) and sparse (BM25) sides simultaneously — a department name in the chunk context becomes both a vector signal and a keyword-searchable term.
Implementation: processing_service.py — _generate_chunk_contexts(), _generate_canonical_questions_batch(), _build_enriched_text(). The embedding model is OpenAI text-embedding-3-large at 1,536 dim (ADR-0048).
Exactly where each artifact is stored
Every chunk is one row in app.document_chunks (model DocumentChunk, app/models/database.py). Enrichment touches three of its columns, and only one enrichment artifact is ever read out again. The columns that matter here:
class DocumentChunk(Base):
__tablename__ = "document_chunks" # schema: app
content: Mapped[str] # RAW chunk text, verbatim — what the LLM quotes
embedding: Mapped[Vector] # embedding of the ENRICHED text (not of content)
search_vector: Mapped[TSVECTOR] # BM25 index built from "title + ENRICHED text"
chunk_metadata: Mapped[dict] # JSONB — DB column literally named "metadata"
original_content: Mapped[str | None] # pre-PII-mask original (unrelated to enrichment)
content_length / token_count / page_number / permission_tags / content_hash
The decisive fact: the enriched text itself is never persisted. It is an in-memory string assembled at ingest by _build_enriched_text(), handed to the embedder and the tsvector builder, and then discarded. What survives is its effect — the embedding vector and the search_vector keyword index. (Note the BM25 vector is built from to_tsvector('simple', title || ' ' || enriched) in _generate_search_vectors(), so the document title is folded into keyword search too.)
The full map of artifact → storage → usage:
| Artifact | Column / key | Written by | Persisted? | Read at query time? |
|---|---|---|---|---|
| Raw chunk text | content | _store_chunks() | ✅ | ✅ — retrieved, displayed, and quoted by the synthesis LLM |
| Enriched text | (none — ephemeral) | _build_enriched_text() | ❌ | — only its derivatives below survive |
| Vector embedding | embedding | _generate_embeddings(enriched_texts) | ✅ | ✅ — dense (vector) leg of hybrid search |
| BM25 keyword index | search_vector | _generate_search_vectors(enriched_texts) | ✅ | ✅ — sparse (BM25) leg of hybrid search |
| Chunk context | metadata.chunk_context | _build_chunk_metadata() (line 436) | ✅ | ❌ — audit copy only; its retrieval effect already lives in embedding/search_vector |
| Canonical questions | metadata.canonical_questions | _build_chunk_metadata() (line 438) | ✅ | ❌ — audit copy only; same as above |
| Page summary | metadata.page_summary | graph-extraction stage (restored on re-ingest) | ✅ | ✅ — read in rag_service.py and prepended to the first chunk at context assembly |
A common misreading is that the query path reads chunk_context / canonical_questions back out of metadata to influence ranking. It does not. Those keys are stored "for auditability" (see the comment at processing_service.py:155 and the _enrich_chunks_with_context() helper) and to let the gap-backfill detect which chunks are missing enrichment. Their entire ranking contribution was baked in once at embedding/indexing time. If you deleted both keys from metadata tomorrow, retrieval quality would be unchanged — but deleting page_summary would change answers, because that one is read live.
Why this split (embed-once vs. read-live)
The design follows the fast-path economics principle: anything that can be precomputed offline should be, and only what genuinely needs runtime context is read at query time.
- Chunk context and canonical questions change what the index matches — a one-time, per-chunk property. Embedding them once and storing only the result keeps the query path free of any per-chunk metadata lookups.
- The page summary changes what the LLM sees when composing the answer — it must be present in the prompt, so it is read live and injected exactly once per document (first retrieved chunk) to avoid token-budget waste from repeating it across sibling chunks.
How enrichment feeds the query-time triad
The pay-off is downstream. Each artifact materially improves a different query-time subsystem:
- Canonical questions → hybrid search recall. Question-phrased user input ("Wat zijn de bezoekuren?") now matches the embedded/indexed question rather than fighting the statement–question gap. This is the single largest BM25 recall win for the most common input pattern.
- Chunk context → retrieval precision. The department/page signal in the embedding helps the right chunks rank highly before the Value Framework even reranks them — enrichment and reranking are complementary (better candidates in, better ordering out).
- Page summary → answer grounding. At context assembly, the once-per-document summary gives the LLM the framing to resolve "which department is this chunk about?" before generating.
Operational notes
- Cost & latency. ~$2.50 to enrich the full ~18,600-chunk corpus; adds ~15–20 min to a full ingest. A 200 ms throttle between enrichment LLM calls prevents 429s during batch runs. See Document Ingestion.
- Gap detection & backfill. Enrichment can fail per-document on transient API errors. The pipeline detects chunks missing canonical questions / documents missing a page summary and backfills only the missing artifact, without re-chunking or re-embedding. See Document Ingestion → Enrichment Retry. This is a silent-failure-discipline guard: a chunk that silently lost its questions would degrade recall invisibly, so the gap is detected and logged.
- Multilingual. Questions and summaries are generated in Dutch (the pilot language). A multi-language tenant generates them in the tenant's language — the same hospital-agnostic invariant the triad and SNOMED editions follow.
References
- Anthropic. (2024). Introducing Contextual Retrieval. — 35–67 % retrieval-failure reduction; the basis for chunk context + page summaries.
- Gao, L., Ma, X., Lin, J., & Callan, J. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). — Hypothetical Document Embeddings; canonical questions invert this to index time.
- Wang, Z., et al. (2023). Learning to Filter Context for RAG (FILCO). — The query-side counterpart to ingestion-time enrichment (see ADR-0019).
- ADR-0019: Contextual Embeddings — the decision record and enriched-text format.
- Document Ingestion Pipeline — Steps 7–8, the procedural detail.
- Context Assembly — Stage 5 — query-time page-summary injection.
- Core Concepts overview — the query-time retrieval-steering triad this enrichment feeds.