Skip to main content

Context Assembly

The key insight from serious RAG architecture: "What you retrieve is not what the model reads." Retrieved chunks are 350-token fragments optimised for embedding similarity — they are not optimised for LLM comprehension. The context assembly service bridges this gap by expanding, deduplicating, grouping, and budgeting the retrieved chunks into coherent reading material whose order is shaped by what we know about LLM context attention from @liu2024lostinmiddle: the highest-relevance document goes first, the lowest-relevance trailing blocks are dropped first, and critical evidence is never placed in the middle of a long context.

Liu et al. (2024) demonstrate a U-shaped attention curve: across a range of models, retrieval accuracy is highest when the relevant passage sits at the start or end of the context and degrades sharply when it sits in the middle — for long contexts, mid-position accuracy can fall below that of the same model given no retrieved context at all. Because this reflects how current decoder models distribute attention over position rather than a quirk of any single model, it holds regardless of which LLM generates the answer. Every ordering choice below follows from it.

See ADR-007 for the architectural rationale behind this decision and the companion answer-first response policy in Prompt Engineering.

Trade-offs

DecisionChosenAlternatives consideredRejected because
Adjacent-chunk expansion±1 chunk per retrieved chunk, halved similarity scoreNo expansion; ±2 expansion; LLM-decided expansionNo expansion produces the disconnected-fragment problem the page opens with. ±2 expansion blew the token budget on long brochures. LLM-decided expansion adds a per-query LLM call to a path that doesn't otherwise need one. ±1 with halved score gives surrounding context without distorting ranking.
Token budget orderingHighest-relevance document first; drop lowest-relevance trailing blocks firstRound-robin across documents; relevance-weighted token allocationRound-robin and relevance-weighted both place medium-relevance content in the middle of the context, where @liu2024lostinmiddle shows LLM attention is weakest. Document-first ordering keeps the most-relevant evidence at the head, where attention is strongest, and drops least-relevant content at the tail when budget pressure applies.
Build-citations timingAfter context assemblyBefore context assemblyContext assembly reorders chunks (grouping by document, sorting by relevance). Building citations from pre-assembly order means the LLM's [1] reference points at the wrong source after grouping. The post-assembly invariant is enforced in _build_citations_from_chunks().
Voice-channel citation sourceDerive citations directly from chunks (no [N] markers in spoken answer)Force the LLM to emit [N] markers and parse themSpoken [N] markers are not pronounceable and break the speech model's prosody. Marker-based extraction failed silently when the LLM omitted markers. The chunk-direct path was introduced after the voice-citation regression (commits d130df74/3cd5cc2f/11a51ab2) and is described in detail in Voice Citation Pipeline.

The Problem

Without context assembly, the LLM receives something like this:

Chunk 1 (similarity: 0.89): "...de raadpleging duurt gemiddeld 30 minuten. U brengt best uw"
Chunk 2 (similarity: 0.85): "brengt best uw identiteitskaart en verwijsbrief mee. Na de raadpleging..."
Chunk 3 (similarity: 0.82): "Cardiologie bevindt zich op campus Sint-Jan, gebouw B, verdieping 2."

Notice the problems:

  1. Chunks 1 and 2 overlap — the 70-token chunking overlap creates redundant text ("brengt best uw")
  2. Chunks are isolated — Chunk 3 is from a different part of the same document but appears disconnected
  3. Missing context — The text between chunks 2 and 3 (which might contain important information) is absent

The Solution: Five-Stage Assembly Pipeline

Stage 1: Expand

For each retrieved chunk, fetch the adjacent chunks (chunk_index ± 1) from the same document. This provides surrounding context that was lost during chunking.

Implementation: A single batched database query per document fetches all needed chunk indexes in one round-trip, minimizing latency.

Example: If chunks 5, 8, and 12 were retrieved from document A, the expansion fetches chunks 4, 5, 6, 7, 8, 9, 11, 12, and 13.

Adjacent (expansion) chunks receive a halved similarity score (original_score × 0.5) to maintain ranking integrity — they are contextually relevant but were not directly retrieved.

Stage 2: Deduplicate

Our chunking uses a 70-token overlap to preserve context across chunk boundaries. When consecutive chunks from the same document are included (which happens frequently after expansion), this overlap creates redundant text.

The deduplication step compares the tail of chunk N with the head of chunk N+1. When an overlap exceeding 20 characters is detected, the duplicate portion is stripped from chunk N+1.

Example:

Before: "...U brengt best uw identiteitskaart en verwijsbrief mee."
"U brengt best uw identiteitskaart en verwijsbrief mee. Na de raadpleging..."

After: "...U brengt best uw identiteitskaart en verwijsbrief mee."
" Na de raadpleging..."

Stage 3: Group by Document

Chunks are sorted so that all chunks from the same document appear together, ordered by chunk_index. This transforms scattered fragments into coherent document sections.

Cross-document ordering: Document groups are sorted by the highest relevance score within each group (descending). The most relevant document appears first.

Stage 4: Enforce Token Budget

The assembled context is capped at a configurable token limit (default: 8,000 tokens, measured with tiktoken cl100k_base). When the budget is exceeded, entire document blocks are dropped from the tail (lowest relevance) until the context fits within budget.

This approach preserves document coherence — it is better to include 2 complete document sections than 3 truncated ones — and is informed by @liu2024lostinmiddle: truncating the middle of a document block deletes content the LLM was about to under-attend to anyway; truncating tail blocks preserves the high-attention head-and-end positions for the highest-relevance content.

Stage 5: Contextual Retrieval (Page Summaries)

Note: Stage 5 is performed by GraphQueryService.build_context(), not by ContextAssemblyService. It is included here for completeness as a logical part of the assembly pipeline.

After token budgeting, the context builder applies contextual retrieval -- a technique based on Anthropic's research (Anthropic, 2024) that prepends document-level context to individual chunks:

The Problem Contextual Retrieval Solves

Individual chunks are 350-token fragments that have lost their document-level context. Consider this chunk:

"De raadpleging duurt gemiddeld 30 minuten. U brengt best uw identiteitskaart en verwijsbrief mee."

Without context, neither the embedding model nor the LLM knows which department this refers to. Is it Cardiology? Orthopedics? Ophthalmology? The page summary resolves this ambiguity:

[Pagina context: Deze pagina beschrijft de afdeling Cardiologie van ZOL, inclusief de artsen, behandelingen en consultatiemogelijkheden op campus Sint-Jan.]

How It Works

Page summaries are pre-computed during document ingestion by the LLM Entity Validator (ADR-0013). They are stored in the chunk_metadata.page_summary JSONB field — no database migration required. For the conceptual treatment of page summaries alongside canonical questions — and how both ingestion-time artifacts feed the query-time retrieval-steering triad — see Ingestion Enrichment.

At query time, the context builder:

  1. Tracks which documents have already had their summary injected (seen_doc_summaries set)
  2. For the first chunk from each document, prepends [Pagina context: \{summary\}]
  3. Subsequent chunks from the same document receive no prefix (avoids redundancy)

This ensures each document's summary appears exactly once in the LLM context, providing maximum context with minimum token overhead.

Impact on Retrieval Quality

Anthropic's research demonstrates that contextual retrieval reduces retrieval failure rates by 49% when combined with BM25 hybrid search (Anthropic, 2024). Since the ZOL system already employs hybrid search (vector + BM25, ADR-0017), page summaries are the natural next step to maximize retrieval relevance.

The summaries particularly help with:

  • Disambiguation: Chunks about "consultatie-uren" are correctly linked to their department
  • Cross-document coherence: The LLM understands the broader topic of each source document
  • Dutch language context: Summaries are generated in Dutch, matching the query language

Citation Alignment

A critical invariant: citations must be built after context assembly, not before. The assembly pipeline reorders chunks (grouping by document, sorting by relevance), which means the chunk at position 1 after assembly may not be the same chunk that was at position 1 before assembly. Since the LLM sees chunks in the post-assembly order and generates [1], [2] references accordingly, the citation list must match that same order.

Before assembly: [chunk-A-5, chunk-B-2, chunk-A-6, chunk-C-1]
After assembly: [chunk-A-5, chunk-A-6, chunk-B-2, chunk-C-1] (grouped by document)
Citations built: [1] = chunk-A source, [2] = chunk-B source, [3] = chunk-C source

Building citations before assembly would cause citation [1] to point to the wrong source document when the LLM references [1] in its response.

Taxonomy Injection Gate Cross-Reference

Taxonomy results from Stage 4b only reach the assembled context when the Taxonomy Injection Gate passes. The gate uses intent and similarity signals to suppress taxonomy injection when vector results are sufficient (the "default suppress" rule). When taxonomy is injected, it is prepended to the assembled context behind the --- AANVULLENDE ZOL INFORMATIE --- separator described in Context Retrieval.

Voice Channel Note

For the voice channel, citations are sourced from the retrieved chunks directly rather than extracted from [N] markers in the answer text (the LLM is instructed to omit numeric markers in spoken answers — they are un-speakable). The chunk-direct fallback path was introduced in commits d130df74 / 3cd5cc2f / 11a51ab2 after a regression in the voice citation pipeline; see Voice Citation Pipeline for the cascade fix and cache-flush discipline. The text channel (chat) retains the marker-based path described above.

Configuration

SettingDefaultRangeDescription
context_assembly_enabledtruebooleanEnable/disable the entire pipeline
context_assembly_expand_window10-3Adjacent chunks to fetch in each direction
context_assembly_max_tokens8000500-16000Maximum tokens in assembled context

Debug Panel Visibility

When showDebugInfo is enabled in the frontend, the pipeline progress displays:

  • Search method badge (orange): Shows hybrid_bm25_vector or vector_only
  • BM25 matches badge (red): Number of BM25 keyword matches found
  • Context expansion badge (cyan): Shows the chunk count transformation, e.g., "7 → 12 chunks"

Performance Impact

Context assembly adds approximately 50-65ms to the query pipeline:

OperationTypical Latency
Expansion (DB queries)~40ms
Deduplication~5ms
Grouping~2ms
Token counting + budget~15ms
Total~62ms

This is negligible compared to the ~3,000ms response generation stage, and the improvement in answer quality justifies the cost.

References