Skip to main content

Ingestion Enrichment: Canonical Questions & Page Summaries

The retrieval-steering triad (Value Framework, Taxonomy, SNOMED) operates at query time. This page covers their ingestion-time complement: two LLM-generated artifacts that are baked into the search index before any query arrives, so that the query-time machinery has richer material to retrieve and rank against.

Both artifacts attack the same root problem — a 350-token chunk, severed from its parent document, loses the context needed to retrieve it well — but from two different angles:

ArtifactWhat it isThe angle it attacksStored inHow it is used
Chunk contextA 50–100 token LLM blurb situating this chunk within its parent documentThe chunk doesn't say which department/page it belongs tometadata.chunk_context (audit copy)Baked into embedding + search_vector via enriched text; the stored copy is not read back
Canonical questions1–2 Dutch questions this chunk could answerUsers search with questions; the corpus is written as statementsmetadata.canonical_questions (audit copy)Baked into embedding + search_vector via enriched text; the stored copy is not read back
Page summaryA 2–3 sentence Dutch description of the whole pageEven an enriched chunk lacks the document's overall framingmetadata.page_summaryRead back at query time and prepended to the first chunk during context assembly
The one distinction that matters

Two of the three artifacts are stored but never read back — chunk context and canonical questions influence retrieval only through the embedding and search_vector they were folded into at ingest. Their metadata copies exist purely for auditing and gap-backfill. The page summary is the only enrichment artifact the query path reads out of metadata again. Get this right and the rest of the page follows; the precise schema map is below.

Where this sits in the flow

This is the upstream companion to the query-time triad. Read it after the Core Concepts overview and before (or alongside) the Taxonomy: enrichment populates the document index; the triad steers retrieval over that index. For the full procedural pipeline see Document Ingestion Pipeline (Steps 7–8); for the decision record see ADR-0019 Contextual Embeddings.

Why ingestion-time enrichment at all?

A retrieval index is only as good as the text it indexes. Two structural mismatches degrade naive chunk indexing:

  1. The context-loss mismatch. Chunking a department page into 350-token windows produces fragments like "De raadpleging duurt gemiddeld 30 minuten. Breng uw identiteitskaart mee." — which department? The embedding captures only the local words, so a query about cardiology consultations may never surface it. Anthropic's contextual retrieval research measured a 35–67 % reduction in retrieval failure when chunks are enriched with document-level context before embedding (see ADR-0019).

  2. The question–statement mismatch. Hospital content is written as declarations ("De dienst Cardiologie behandelt hartfalen."); users type questions ("Bij wie moet ik zijn voor hartfalen?"). In embedding space, a question and its answering statement are near but not identical. Canonical questions close this gap by generating the likely questions at index time and embedding them alongside the chunk — a form of HyDE applied at indexing rather than query time (Gao et al.'s Hypothetical Document Embeddings, inverted).

Solving these at ingestion has a decisive property: zero added query-time latency. The LLM cost is paid once, during the nightly ingest, and amortised over every subsequent query. This is the same fast-path economics the SNOMED synonym cache uses — precompute the expensive thing offline.

The two artifacts in detail

Canonical questions (HyDE-at-index-time)

For each chunk, the Tier 2 model generates 1–2 Dutch questions the chunk answers:

Prompt: Gegeven deze tekst van een ziekenhuiswebsite, genereer 1-2 vragen (in het Nederlands) die door deze tekst beantwoord worden.

Chunk contentGenerated canonical question
Visiting-hours paragraph on the cardiology page"Wat zijn de bezoekuren op cardiologie?"
Dr. Van den Berg's profile"Wie is de orthopedisch chirurg bij ZOL?"

The theoretical basis is HyDE (Hypothetical Document Embeddings, Gao et al. 2022): instead of embedding the query directly, embed a hypothetical answer and match on that, because answer-to-answer similarity beats question-to-answer similarity. Standard HyDE does this at query time (one LLM call per query). ZOL inverts it — generating the hypothetical questions per chunk at index time — so the alignment benefit is captured with no per-query cost. See Query Enrichment for the query-side counterpart.

Page summaries (Anthropic contextual retrieval)

During graph extraction, the LLM produces a 2–3 sentence Dutch description of the entire page:

[Pagina context: Deze pagina beschrijft de afdeling Cardiologie van ZOL, inclusief de artsen, behandelingen en consultatiemogelijkheden op campus Sint-Jan.]

This is generated for every page — even non-hub pages that write no entities to the taxonomy — precisely so that all chunks gain contextual grounding at query time. It is stored in chunk_metadata.page_summary and, unlike the other two artifacts, is not baked into the embedding: it is prepended to the first retrieved chunk of each document during context assembly Stage 5, exactly once per document, to avoid token-budget waste.

Chunk context vs page summary — a subtle distinction

Both situate a chunk in its document, but they differ in scope and consumption. Chunk context is per-chunk, written at ~50–100 tokens, and embedded (it changes what the vector index matches). The page summary is per-document, ~2–3 sentences, and injected into the prompt at query time (it changes what the LLM sees, not what the index matches). They are stored under distinct JSONB keys — metadata.chunk_context vs metadata.page_summary — written at different pipeline stages (chunk context during chunk processing; page summary during graph extraction). The one exception is the enrichment-backfill path (_backfill_chunk_enrichment): when it retries a chunk whose context failed to generate on the first pass, it writes the recovered per-chunk context into both chunk_context and page_summary keys. So a backfilled chunk can carry a page_summary that is actually its chunk-level context. See ADR-0014 and ADR-0019 for the lineage.

The enriched-text format: one input, two indexes

Chunk context and canonical questions are not stored as the chunk — they are concatenated into an enriched text that is fed to both the embedding model and the BM25 tsvector builder. The raw chunk is preserved verbatim in the content column and is what the LLM ultimately quotes.

enriched_text =
{chunk_context} ← situates the chunk in its document
{canonical_questions} ← the questions this chunk answers
{original_chunk_text} ← never truncated; the substantive content

The raw chunk is always included in full; only the context and questions are truncated if the combined text would exceed the embedding model's effective window (the 3,000-char cap in _build_enriched_text). Embedding both lanes from the same enriched text is what makes hybrid search benefit on both the dense (vector) and sparse (BM25) sides simultaneously — a department name in the chunk context becomes both a vector signal and a keyword-searchable term.

Implementation: processing_service.py_generate_chunk_contexts(), _generate_canonical_questions_batch(), _build_enriched_text(). The embedding model is OpenAI text-embedding-3-large at 1,536 dim (ADR-0048).

Exactly where each artifact is stored

Every chunk is one row in app.document_chunks (model DocumentChunk, app/models/database.py). Enrichment touches three of its columns, and only one enrichment artifact is ever read out again. The columns that matter here:

class DocumentChunk(Base):
__tablename__ = "document_chunks" # schema: app

content: Mapped[str] # RAW chunk text, verbatim — what the LLM quotes
embedding: Mapped[Vector] # embedding of the ENRICHED text (not of content)
search_vector: Mapped[TSVECTOR] # BM25 index built from "title + ENRICHED text"
chunk_metadata: Mapped[dict] # JSONB — DB column literally named "metadata"
original_content: Mapped[str | None] # pre-PII-mask original (unrelated to enrichment)
content_length / token_count / page_number / permission_tags / content_hash

The decisive fact: the enriched text itself is never persisted. It is an in-memory string assembled at ingest by _build_enriched_text(), handed to the embedder and the tsvector builder, and then discarded. What survives is its effect — the embedding vector and the search_vector keyword index. (Note the BM25 vector is built from to_tsvector('simple', title || ' ' || enriched) in _generate_search_vectors(), so the document title is folded into keyword search too.)

The full map of artifact → storage → usage:

ArtifactColumn / keyWritten byPersisted?Read at query time?
Raw chunk textcontent_store_chunks()✅ — retrieved, displayed, and quoted by the synthesis LLM
Enriched text(none — ephemeral)_build_enriched_text()— only its derivatives below survive
Vector embeddingembedding_generate_embeddings(enriched_texts)✅ — dense (vector) leg of hybrid search
BM25 keyword indexsearch_vector_generate_search_vectors(enriched_texts)✅ — sparse (BM25) leg of hybrid search
Chunk contextmetadata.chunk_context_build_chunk_metadata() (line 436)❌ — audit copy only; its retrieval effect already lives in embedding/search_vector
Canonical questionsmetadata.canonical_questions_build_chunk_metadata() (line 438)❌ — audit copy only; same as above
Page summarymetadata.page_summarygraph-extraction stage (restored on re-ingest)✅ — read in rag_service.py and prepended to the first chunk at context assembly
Audit copies are not retrieval inputs

A common misreading is that the query path reads chunk_context / canonical_questions back out of metadata to influence ranking. It does not. Those keys are stored "for auditability" (see the comment at processing_service.py:155 and the _enrich_chunks_with_context() helper) and to let the gap-backfill detect which chunks are missing enrichment. Their entire ranking contribution was baked in once at embedding/indexing time. If you deleted both keys from metadata tomorrow, retrieval quality would be unchanged — but deleting page_summary would change answers, because that one is read live.

Why this split (embed-once vs. read-live)

The design follows the fast-path economics principle: anything that can be precomputed offline should be, and only what genuinely needs runtime context is read at query time.

  • Chunk context and canonical questions change what the index matches — a one-time, per-chunk property. Embedding them once and storing only the result keeps the query path free of any per-chunk metadata lookups.
  • The page summary changes what the LLM sees when composing the answer — it must be present in the prompt, so it is read live and injected exactly once per document (first retrieved chunk) to avoid token-budget waste from repeating it across sibling chunks.

How enrichment feeds the query-time triad

The pay-off is downstream. Each artifact materially improves a different query-time subsystem:

  • Canonical questions → hybrid search recall. Question-phrased user input ("Wat zijn de bezoekuren?") now matches the embedded/indexed question rather than fighting the statement–question gap. This is the single largest BM25 recall win for the most common input pattern.
  • Chunk context → retrieval precision. The department/page signal in the embedding helps the right chunks rank highly before the Value Framework even reranks them — enrichment and reranking are complementary (better candidates in, better ordering out).
  • Page summary → answer grounding. At context assembly, the once-per-document summary gives the LLM the framing to resolve "which department is this chunk about?" before generating.

Operational notes

  • Cost & latency. ~$2.50 to enrich the full ~18,600-chunk corpus; adds ~15–20 min to a full ingest. A 200 ms throttle between enrichment LLM calls prevents 429s during batch runs. See Document Ingestion.
  • Gap detection & backfill. Enrichment can fail per-document on transient API errors. The pipeline detects chunks missing canonical questions / documents missing a page summary and backfills only the missing artifact, without re-chunking or re-embedding. See Document Ingestion → Enrichment Retry. This is a silent-failure-discipline guard: a chunk that silently lost its questions would degrade recall invisibly, so the gap is detected and logged.
  • Multilingual. Questions and summaries are generated in Dutch (the pilot language). A multi-language tenant generates them in the tenant's language — the same hospital-agnostic invariant the triad and SNOMED editions follow.

References