Skip to main content

Context Retrieval Architecture

The ZOL Intelligent Search system uses a multi-signal retrieval pipeline that combines vector search (@karpukhin2020dpr), BM25 keyword matching (Robertson & Zaragoza 2009), knowledge-graph traversal, metadata boosting, the always-on Value Framework affinity rerank (Stage 5b), the conditional synthetic doctor-list injection (Stage 5c), and contextual enrichment into a unified context-retrieval mechanism. This page explains how each signal contributes to the final context the LLM reads.

See ADR-0017 for the architectural decision record.

Stage Numbering

This page numbers the retrieval-focused stages sequentially (1-8). For the end-to-end pipeline stage numbers (which include the semantic cache check and quality evaluation), see the Query Processing Pipeline. The mapping is: this page's Stage 1 = Pipeline Stage 2-3, Stage 2 = Pipeline Stage 5, Stage 3-4 = Pipeline Stage 5-6, and so on.

Pipeline Overview

Stage 1: Intent Classification + Query Rewriting

Corresponds to Pipeline Stages 2-3

Before any retrieval occurs, the system classifies the user's intent and rewrites follow-up queries:

  • Intent classification (Tier 2, ~400ms): Maps the query to one of 12 intent categories. Four intents (out_of_scope_medical_advice, off_topic, other_hospital, vague_input) are blocked immediately with a safety response.
  • Query rewriting (Tier 2, combined with intent classification when history exists): The LLM reformulates every query — regardless of input language — into a canonical Dutch clinical sentence, and resolves follow-up pronouns against conversation history. This maximizes both retrieval consistency and cache hit rates. The canonical templates, the three-purposes rationale, and the cache relationship are documented on the Query Rewriting page.

Strategy selection: All non-blocked intents default to the HYBRID retrieval strategy, which runs vector search and taxonomy search sequentially. The retrieval strategy enum retains VECTOR_ONLY and GRAPH_ONLY values for future use, but all safe intents currently route to HYBRID.

After intent classification, two additional pre-retrieval steps occur:

  • Taxonomy enrichment (Step 5b): resolve_search_query() maps patient-friendly Dutch terms to canonical entity names (e.g., "hartfilmpje" to "ECG") and appends resolved department/examination names to the search query.
  • Query decomposition (Step 5c, feature-flagged): For complex multi-hop questions, an LLM gate detects queries requiring multiple independent evidence chains and decomposes them into focused sub-questions. Each sub-question is retrieved in parallel, and results are merged. See the dedicated Query Decomposition page for details.

Stage 2: Three Retrieval Channels

Corresponds to Pipeline Stage 5: Sequential Hybrid Retrieval

Three retrieval channels execute sequentially (asyncpg does not support concurrent queries on the same session), each optimized for different query types:

Vector Search (Semantic Similarity)

pgvector (pgvector docs) cosine similarity search against 1,536-dimensional text-embedding-3-large embeddings (OpenAI 2024; see ADR-0048 for the migration from BGE-M3), using HNSW indexing (Malkov & Yashunin, 2018) — the dense bi-encoder retrieval pattern follows Karpukhin et al. 2020.

  • Over-fetches limit × 4 candidates (default: 28 candidates for 7 final results)
  • Minimum similarity threshold: 0.40
  • Embedding prefix convention: search_query: (queries) / search_document: (documents) is only used when EMBEDDING_PROVIDER=ollama and a Nomic model is configured. The default text-embedding-3-large (OpenAI API) uses no prefixes.

Strengths: Semantic paraphrases ("pijn in de borst" matches "thoracale klachten"), conceptual similarity.

Weaknesses: Specific medical terms and proper names may score below the similarity threshold.

BM25 Keyword Search (Exact Term Matching)

BM25-style keyword search (Robertson & Zaragoza, 2009) via PostgreSQL tsvector with GIN index, using 'simple' text configuration (no stemming):

search_vector = to_tsvector('simple', title || ' ' || content || ' ' || canonical_questions)

The 'simple' config preserves exact Dutch medical terminology — stemming would incorrectly conflate terms like "behandeling" and "behandelingen" for different medical contexts.

Strengths: Exact keyword matches ("Dr. Vanderstraeten", "cardioversie"), proper names, technical terms.

Weaknesses: Cannot detect synonyms or semantic equivalents.

Canonical Questions

Each chunk stores 1-2 pre-generated Dutch questions in chunk_metadata.canonical_questions, included in the BM25 tsvector. Generated at ingestion time by the Tier 2 model.

Example: A chunk about visiting hours generates:

  • "Wat zijn de bezoekuren van het ZOL ziekenhuis?"
  • "Wanneer kan ik op bezoek komen?"

These questions are injected into the search_vector, so a user query matching the generated question text gets a BM25 boost even if the chunk content uses different phrasing.

Canonical Questions and Vector Search

Canonical questions currently only affect BM25 keyword search. They are not embedded as separate vectors. This is a known gap — full HyPE (Hypothetical Prompt Embeddings) would store questions as additional embedding vectors for even better recall.

Taxonomy Search (Entity Relationships)

PostgreSQL taxonomy search using a tiered approach:

TierMethodUse CaseExample
Tier 1PostgreSQL taxonomy queriesStructured entity lookups"Which doctors work in Cardiology?"
Tier 1bTaxonomy alias resolutionIndirect entity matching"hartfalen" → Hartfalen → Cardiologie via HANDLES

Taxonomy alias resolution: Before graph queries execute, resolve_search_query() maps patient-friendly Dutch terms to canonical names. Example aliases:

Patient TermCanonical Form
huidartsDermatologie
suikerziekteDiabetes Mellitus
hartfilmpjeECG
oogartsOftalmologie
scanRadiologie

This ensures that queries using everyday language still find the correct graph entities.

Stage 3: Score Fusion (RRF)

Part of Pipeline Stage 5

Results from vector and BM25 are combined using Reciprocal Rank Fusion (RRF; Cormack et al., 2009) with k=60:

rrf_score = Σ 1/(k + rank_i + 1) for each result list i containing the document

Chunks found only by BM25 (not in vector results) receive similarity=0.0 and are ranked purely by their RRF score. This means BM25-only chunks compete on rank position rather than receiving an artificial similarity floor.

Graph results are merged with priority ordering: typed graph > vector > semantic graph. Typed node results receive fixed high similarity scores (0.90-0.95) reflecting their high precision for entity queries.

Stage 4: Metadata Boosting

Corresponds to Pipeline Stage 6: Metadata Boosting

Multiple multiplicative boost signals re-rank results based on contextual metadata, covering category relevance, recency, entity type alignment, conversation continuity, and more. These signals leverage enriched document metadata populated during ingestion.

For the complete metadata boosting algorithm with all signals, weights, conditions, and rationale, see Stage 6: Metadata Boosting in the Query Pipeline documentation.

Stage 5b: Value Framework Affinity Rerank

Always-on rerank step that runs after Stage 4 and before Stage 5. Each retrieved chunk is classified into one of six content categories (practical, appointments, general, legal_admin, clinical_info, regulatory) and its score is multiplied by the matching coefficient from the intent × content_category affinity matrix. Chunks are then re-sorted by the new score. The rerank prevents cross-category contamination (e.g., the wheelchair-vs-cardiology regression where high vector similarity to "rolstoel" surfaced cardiology content under a navigational intent). Telemetry is written to app.category_mismatch_telemetry per query for operator review. See Query Pipeline §Stage 5b for the algorithm and Reranking & Evaluation for the broader rerank pipeline.

Stage 5c: Synthetic Doctor-List Injection

Conditional rerank step that fires only when intent in DOCTOR_LOOKUP or DEPARTMENT_OR_SERVICE_LOOKUP AND the query contains a list-signal phrase AND a department hint can be resolved. When all three gates pass, the stage queries the taxonomy for all doctors associated with the resolved department and inserts a synthetic chunk listing them into the retrieved-chunks set before context assembly. This guarantees the LLM has the full roster when answering "alle X-ologen bij ZOL"-style questions; without it, retrieval might surface a partial list of individual doctor brochures. When any gate fails, the stage is a no-op (zero latency). See Taxonomy Query Enrichment §Stage 5c and Query Pipeline §Stage 5c.

Stage 5: Keyword Rescue

A last-resort fallback that catches specific terms (6+ characters) missed by both vector and BM25:

  1. Extract specific terms from the query (excluding Dutch stop words)
  2. Check if each term appears in any of the top-K results
  3. For missing terms, execute a direct ILIKE '%term%' search on chunk content
  4. Rescued chunks receive hardcoded scores (similarity=0.85, boosted_score=0.90)

Example: The query "Welke arts bij psoriasis?" — if "psoriasis" (9 chars) does not appear in any top result, the rescue search finds chunks containing that exact term.

Performance Note

Keyword rescue uses unindexed ILIKE which is O(N) over all chunks. At the current corpus size this is fast, but it should be revisited if the corpus grows significantly.

Stage 6: Context Assembly

Corresponds to Pipeline Stage 6b: Context Assembly

See the dedicated Context Assembly page for details. In brief:

  1. Expand: Fetch ±1 adjacent chunks per retrieved chunk
  2. Deduplicate: Strip ~70-token chunking overlaps
  3. Group by document: Merge chunks from same document into coherent blocks
  4. Token budget: Cap at 8,000 tokens, dropping lowest-relevance blocks first

Stage 7: Context Building with Page Summaries

The assembled chunks are formatted into the final context string that the LLM reads.

Page Summary Injection

Each document's page summary (generated during ingestion by the LLM Entity Validator) is prepended to the first chunk from that document:

[1] Uit cardiologie_raadpleging.html (pagina 1):
[Pagina context: Deze pagina beschrijft de afdeling Cardiologie van ZOL,
inclusief de artsen, behandelingen en consultatiemogelijkheden op campus Sint-Jan.]

De raadpleging duurt gemiddeld 30 minuten. U brengt best uw identiteitskaart
en verwijsbrief mee.

Without the page summary, the LLM would not know which department this consultation information refers to. The summary resolves this ambiguity without adding an LLM call at query time (summaries are pre-computed).

Full Contextual Retrieval (ADR-0019)

The system implements Anthropic's contextual retrieval pattern at all three levels: (1) LLM-generated chunk context is prepended before embedding for enriched vector search, (2) the enriched text is used for BM25 indexing, and (3) page summaries are prepended at generation time when building the LLM context. This three-level approach reduces retrieval failure rates by up to 67% compared to naive chunking. See ADR-0019.

Graph Context Separation

When graph results are present, they appear after a separator:

--- AANVULLENDE ZOL INFORMATIE ---
Dr. Van den Berg is verbonden aan de afdeling Cardiologie op campus Sint-Jan.
Consultatie: maandag en woensdag.

The RAG system prompt includes special instructions for handling graph context, telling the LLM to integrate this structured data with the document context.

Stage 8: LLM Generation

Corresponds to Pipeline Stage 7: Response Generation

The final messages sent to the LLM:

  1. System prompt (Dutch): Strict grounding rules, citation format [1], no medical advice
  2. Conversation history: Last 5 exchanges for multi-turn context
  3. User message: Contextdocumenten:\n\{context\}\n\nVraag: \{question\}

Model: Tier 2 (gpt-4.1 direct), fallback chain to local Ollama model. Tier 3 (gpt-5.2) is reserved for escalated search only.

Escalated Search (Think Harder)

When users signal dissatisfaction, the Think Harder flow provides enhanced retrieval:

AspectNormal PipelineEscalated
Candidates20 (full mode)100
Min similarity0.400.35
RerankerJina Reranker v2 (BGE-reranker-v2-m3 fallback)Jina Reranker v2 (BGE-reranker-v2-m3 fallback)
After rerankingTop 15 (full mode)Top 20 (configurable via rag_escalation_rerank_top_k)
LLM modelTier 2 (gpt-4.1)Escalation model (gpt-5.2)
Max tokens1,000 / 1,500 (full mode)3,000

The cross-encoder reranker jointly encodes (query, document) pairs, providing more accurate relevance scoring than the bi-encoder similarity used in normal retrieval.

How Each Enhancement Contributes

EnhancementStageImpactQuery Types Helped
Canonical questions2 (BM25)Bridges vocabulary gap"Wat zijn de bezoekuren?" → exact match
Page summaries7 (Context)Disambiguates chunks"consultatie" → which department?
Taxonomy aliases2 (Graph)Dutch term resolution"huidarts" → Dermatologie
Metadata boosting4Re-ranks by relevance signalsFollow-up queries, campus-specific queries
Content keyword boost4Promotes exact term presenceSpecific medical terms
Keyword rescue5Catches missed termsRare procedures, specific doctor names
Context assembly6Coherent document sectionsAll content queries
Query decomposition1bSplits multi-hop into sub-queries"Doctor X on campus Y doing procedure Z"
Cross-encoder rerankingEscalatedPrecision improvementComplex queries after initial failure

Data Flow: Complete Example

Query: "Welke arts behandelt epilepsie bij kinderen?"

  1. Intent: doctor_lookup (confidence: 0.92) → forces HYBRID
  2. Vector: Returns 28 candidates about neurology, pediatrics, epilepsy info
  3. BM25: Matches "epilepsie" and "kinderen" in chunk content + canonical questions
  4. Graph: Typed node query finds doctors with condition "Epilepsie" via HANDLES relationship (Department handles Condition)
  5. Score fusion: RRF (k=60); graph results merged first
  6. Metadata boost: Neurology documents get +20% category match; pediatrics docs get +10% entity type
  7. Keyword rescue: "epilepsie" (9 chars) verified present in top results — no rescue needed
  8. Context assembly: Expand ±1 for neurology page, deduplicate, group, budget to 8,000 tokens
  9. Context building: Neurology page summary prepended; graph shows "Dr. X, Neurologie, campus Sint-Jan"
  10. LLM: Generates Dutch response with [1] citations, includes graph entity data

Best Practice Alignment

The pipeline aligns with 2025-2026 production RAG best practices:

Best PracticeStatusReference
Hybrid search (vector + BM25)ImplementedADR-007
Knowledge graph for entitiesImplementedADR-006
Contextual retrievalFull (embedding + BM25 + generation-time)ADR-0019
Canonical questions / HyPEPartial (BM25 only, not embedded)ADR-007
Cross-encoder rerankingAlways-on (full mode)ADR-0024
Intent-driven strategy selectionImplementedBuilt-in
Multi-turn conversation contextImplemented (25% boost)Built-in
Token budget managementImplemented (8,000 tokens)ADR-007
Safety-first (no medical advice)Implemented (multi-layer)Built-in
Taxonomy-driven normalizationImplemented (960+ lines)ADR-0014
Multi-hop query decompositionImplemented (feature-flagged)ADR-0032
Value Framework intent-category affinity rerankImplemented (always-on, Stage 5b)migration-066
Synthetic doctor-list injectionImplemented (Stage 5c, conditional)_qs_maybe_inject_doctor_list

References

Foundational Research

Embedding Models

Retrieval Enhancement Techniques

Query Decomposition

GraphRAG and Knowledge Graph Integration

Industry References