Context Retrieval Architecture

The ZOL Intelligent Search system uses a multi-signal retrieval pipeline that combines vector search (@karpukhin2020dpr), BM25 keyword matching (Robertson & Zaragoza 2009), knowledge-graph traversal, metadata boosting, the always-on Value Framework affinity rerank (Stage 5b), the conditional synthetic doctor-list injection (Stage 5c), and contextual enrichment into a unified context-retrieval mechanism. This page explains how each signal contributes to the final context the LLM reads.

See ADR-0017 for the architectural decision record.

Stage Numbering

This page numbers the retrieval-focused stages sequentially (1-8). For the end-to-end pipeline stage numbers (which include the semantic cache check and quality evaluation), see the Query Processing Pipeline. The mapping is: this page's Stage 1 = Pipeline Stage 2-3, Stage 2 = Pipeline Stage 5, Stage 3-4 = Pipeline Stage 5-6, and so on.

Pipeline Overview

Stage 1: Intent Classification + Query Rewriting

Corresponds to Pipeline Stages 2-3

Before any retrieval occurs, the system classifies the user's intent and rewrites follow-up queries:

Intent classification (Tier 2, ~400ms): Maps the query to one of 12 intent categories. Four intents (out_of_scope_medical_advice, off_topic, other_hospital, vague_input) are blocked immediately with a safety response.
Query rewriting (Tier 2, combined with intent classification when history exists): The LLM reformulates every query — regardless of input language — into a canonical Dutch clinical sentence, and resolves follow-up pronouns against conversation history. This maximizes both retrieval consistency and cache hit rates. The canonical templates, the three-purposes rationale, and the cache relationship are documented on the Query Rewriting page.

Strategy selection: All non-blocked intents default to the HYBRID retrieval strategy, which runs vector search and taxonomy search sequentially. The retrieval strategy enum retains VECTOR_ONLY and GRAPH_ONLY values for future use, but all safe intents currently route to HYBRID.

After intent classification, two additional pre-retrieval steps occur:

Taxonomy enrichment (Step 5b): resolve_search_query() maps patient-friendly Dutch terms to canonical entity names (e.g., "hartfilmpje" to "ECG") and appends resolved department/examination names to the search query.
Query decomposition (Step 5c, feature-flagged): For complex multi-hop questions, an LLM gate detects queries requiring multiple independent evidence chains and decomposes them into focused sub-questions. Each sub-question is retrieved in parallel, and results are merged. See the dedicated Query Decomposition page for details.

Stage 2: Three Retrieval Channels

Corresponds to Pipeline Stage 5: Sequential Hybrid Retrieval

Three retrieval channels execute sequentially (asyncpg does not support concurrent queries on the same session), each optimized for different query types:

Vector Search (Semantic Similarity)

pgvector (pgvector docs) cosine similarity search against 1,536-dimensional text-embedding-3-large embeddings (OpenAI 2024; see ADR-0048 for the migration from BGE-M3), using HNSW indexing (Malkov & Yashunin, 2018) — the dense bi-encoder retrieval pattern follows Karpukhin et al. 2020.

Over-fetches limit × 4 candidates (default: 28 candidates for 7 final results)
Minimum similarity threshold: 0.40
Embedding prefix convention: search_query: (queries) / search_document: (documents) is only used when EMBEDDING_PROVIDER=ollama and a Nomic model is configured. The default text-embedding-3-large (OpenAI API) uses no prefixes.

Strengths: Semantic paraphrases ("pijn in de borst" matches "thoracale klachten"), conceptual similarity.

Weaknesses: Specific medical terms and proper names may score below the similarity threshold.

BM25 Keyword Search (Exact Term Matching)

BM25-style keyword search (Robertson & Zaragoza, 2009) via PostgreSQL tsvector with GIN index, using 'simple' text configuration (no stemming):

search_vector = to_tsvector('simple', title || ' ' || content || ' ' || canonical_questions)

The 'simple' config preserves exact Dutch medical terminology — stemming would incorrectly conflate terms like "behandeling" and "behandelingen" for different medical contexts.

Strengths: Exact keyword matches ("Dr. Vanderstraeten", "cardioversie"), proper names, technical terms.

Weaknesses: Cannot detect synonyms or semantic equivalents.

Canonical Questions

Each chunk stores 1-2 pre-generated Dutch questions in chunk_metadata.canonical_questions, included in the BM25 tsvector. Generated at ingestion time by the Tier 2 model.

Example: A chunk about visiting hours generates:

"Wat zijn de bezoekuren van het ZOL ziekenhuis?"
"Wanneer kan ik op bezoek komen?"

These questions are injected into the search_vector, so a user query matching the generated question text gets a BM25 boost even if the chunk content uses different phrasing.

Canonical Questions and Vector Search

Canonical questions currently only affect BM25 keyword search. They are not embedded as separate vectors. This is a known gap — full HyPE (Hypothetical Prompt Embeddings) would store questions as additional embedding vectors for even better recall.

Taxonomy Search (Entity Relationships)

PostgreSQL taxonomy search using a tiered approach:

Tier	Method	Use Case	Example
Tier 1	PostgreSQL taxonomy queries	Structured entity lookups	"Which doctors work in Cardiology?"
Tier 1b	Taxonomy alias resolution	Indirect entity matching	"hartfalen" → Hartfalen → Cardiologie via HANDLES

Taxonomy alias resolution: Before graph queries execute, resolve_search_query() maps patient-friendly Dutch terms to canonical names. Example aliases:

Patient Term	Canonical Form
huidarts	Dermatologie
suikerziekte	Diabetes Mellitus
hartfilmpje	ECG
oogarts	Oftalmologie
scan	Radiologie

This ensures that queries using everyday language still find the correct graph entities.

Stage 3: Score Fusion (RRF)

Part of Pipeline Stage 5

Results from vector and BM25 are combined using Reciprocal Rank Fusion (RRF; Cormack et al., 2009) with k=60:

rrf_score = Σ 1/(k + rank_i + 1)    for each result list i containing the document

Chunks found only by BM25 (not in vector results) receive similarity=0.0 and are ranked purely by their RRF score. This means BM25-only chunks compete on rank position rather than receiving an artificial similarity floor.

Graph results are merged with priority ordering: typed graph > vector > semantic graph. Typed node results receive fixed high similarity scores (0.90-0.95) reflecting their high precision for entity queries.

Stage 4: Metadata Boosting

Corresponds to Pipeline Stage 6: Metadata Boosting

Multiple multiplicative boost signals re-rank results based on contextual metadata, covering category relevance, recency, entity type alignment, conversation continuity, and more. These signals leverage enriched document metadata populated during ingestion.

For the complete metadata boosting algorithm with all signals, weights, conditions, and rationale, see Stage 6: Metadata Boosting in the Query Pipeline documentation.

Stage 5b: Value Framework Affinity Rerank

Always-on rerank step that runs after Stage 4 and before Stage 5. Each retrieved chunk is classified into one of six content categories (practical, appointments, general, legal_admin, clinical_info, regulatory) and its score is multiplied by the matching coefficient from the intent × content_category affinity matrix. Chunks are then re-sorted by the new score. The rerank prevents cross-category contamination (e.g., the wheelchair-vs-cardiology regression where high vector similarity to "rolstoel" surfaced cardiology content under a navigational intent). Telemetry is written to app.category_mismatch_telemetry per query for operator review. See Query Pipeline §Stage 5b for the algorithm and Reranking & Evaluation for the broader rerank pipeline.

Stage 5c: Synthetic Doctor-List Injection

Conditional rerank step that fires only when intent in DOCTOR_LOOKUP or DEPARTMENT_OR_SERVICE_LOOKUP AND the query contains a list-signal phrase AND a department hint can be resolved. When all three gates pass, the stage queries the taxonomy for all doctors associated with the resolved department and inserts a synthetic chunk listing them into the retrieved-chunks set before context assembly. This guarantees the LLM has the full roster when answering "alle X-ologen bij ZOL"-style questions; without it, retrieval might surface a partial list of individual doctor brochures. When any gate fails, the stage is a no-op (zero latency). See Taxonomy Query Enrichment §Stage 5c and Query Pipeline §Stage 5c.

Stage 5: Keyword Rescue

A last-resort fallback that catches specific terms (6+ characters) missed by both vector and BM25:

Extract specific terms from the query (excluding Dutch stop words)
Check if each term appears in any of the top-K results
For missing terms, execute a direct ILIKE '%term%' search on chunk content
Rescued chunks receive hardcoded scores (similarity=0.85, boosted_score=0.90)

Example: The query "Welke arts bij psoriasis?" — if "psoriasis" (9 chars) does not appear in any top result, the rescue search finds chunks containing that exact term.

Performance Note

Keyword rescue uses unindexed ILIKE which is O(N) over all chunks. At the current corpus size this is fast, but it should be revisited if the corpus grows significantly.

Stage 6: Context Assembly

Corresponds to Pipeline Stage 6b: Context Assembly

See the dedicated Context Assembly page for details. In brief:

Expand: Fetch ±1 adjacent chunks per retrieved chunk
Deduplicate: Strip ~70-token chunking overlaps
Group by document: Merge chunks from same document into coherent blocks
Token budget: Cap at 8,000 tokens, dropping lowest-relevance blocks first

Stage 7: Context Building with Page Summaries

The assembled chunks are formatted into the final context string that the LLM reads.

Page Summary Injection

Each document's page summary (generated during ingestion by the LLM Entity Validator) is prepended to the first chunk from that document:

[1] Uit cardiologie_raadpleging.html (pagina 1):
[Pagina context: Deze pagina beschrijft de afdeling Cardiologie van ZOL,
inclusief de artsen, behandelingen en consultatiemogelijkheden op campus Sint-Jan.]

De raadpleging duurt gemiddeld 30 minuten. U brengt best uw identiteitskaart
en verwijsbrief mee.

Without the page summary, the LLM would not know which department this consultation information refers to. The summary resolves this ambiguity without adding an LLM call at query time (summaries are pre-computed).

Full Contextual Retrieval (ADR-0019)

The system implements Anthropic's contextual retrieval pattern at all three levels: (1) LLM-generated chunk context is prepended before embedding for enriched vector search, (2) the enriched text is used for BM25 indexing, and (3) page summaries are prepended at generation time when building the LLM context. This three-level approach reduces retrieval failure rates by up to 67% compared to naive chunking. See ADR-0019.

Graph Context Separation

When graph results are present, they appear after a separator:

--- AANVULLENDE ZOL INFORMATIE ---
Dr. Van den Berg is verbonden aan de afdeling Cardiologie op campus Sint-Jan.
Consultatie: maandag en woensdag.

The RAG system prompt includes special instructions for handling graph context, telling the LLM to integrate this structured data with the document context.

Stage 8: LLM Generation

Corresponds to Pipeline Stage 7: Response Generation

The final messages sent to the LLM:

System prompt (Dutch): Strict grounding rules, citation format [1], no medical advice
Conversation history: Last 5 exchanges for multi-turn context
User message: Contextdocumenten:\n\{context\}\n\nVraag: \{question\}

Model: Tier 2 (gpt-4.1 direct), fallback chain to local Ollama model. Tier 3 (gpt-5.2) is reserved for escalated search only.

Escalated Search (Think Harder)

When users signal dissatisfaction, the Think Harder flow provides enhanced retrieval:

Aspect	Normal Pipeline	Escalated
Candidates	20 (full mode)	100
Min similarity	0.40	0.35
Reranker	Jina Reranker v2 (BGE-reranker-v2-m3 fallback)	Jina Reranker v2 (BGE-reranker-v2-m3 fallback)
After reranking	Top 15 (full mode)	Top 20 (configurable via `rag_escalation_rerank_top_k`)
LLM model	Tier 2 (gpt-4.1)	Escalation model (gpt-5.2)
Max tokens	1,000 / 1,500 (full mode)	3,000

The cross-encoder reranker jointly encodes (query, document) pairs, providing more accurate relevance scoring than the bi-encoder similarity used in normal retrieval.

How Each Enhancement Contributes

Enhancement	Stage	Impact	Query Types Helped
Canonical questions	2 (BM25)	Bridges vocabulary gap	"Wat zijn de bezoekuren?" → exact match
Page summaries	7 (Context)	Disambiguates chunks	"consultatie" → which department?
Taxonomy aliases	2 (Graph)	Dutch term resolution	"huidarts" → Dermatologie
Metadata boosting	4	Re-ranks by relevance signals	Follow-up queries, campus-specific queries
Content keyword boost	4	Promotes exact term presence	Specific medical terms
Keyword rescue	5	Catches missed terms	Rare procedures, specific doctor names
Context assembly	6	Coherent document sections	All content queries
Query decomposition	1b	Splits multi-hop into sub-queries	"Doctor X on campus Y doing procedure Z"
Cross-encoder reranking	Escalated	Precision improvement	Complex queries after initial failure

Data Flow: Complete Example

Query: "Welke arts behandelt epilepsie bij kinderen?"

Intent: doctor_lookup (confidence: 0.92) → forces HYBRID
Vector: Returns 28 candidates about neurology, pediatrics, epilepsy info
BM25: Matches "epilepsie" and "kinderen" in chunk content + canonical questions
Graph: Typed node query finds doctors with condition "Epilepsie" via HANDLES relationship (Department handles Condition)
Score fusion: RRF (k=60); graph results merged first
Metadata boost: Neurology documents get +20% category match; pediatrics docs get +10% entity type
Keyword rescue: "epilepsie" (9 chars) verified present in top results — no rescue needed
Context assembly: Expand ±1 for neurology page, deduplicate, group, budget to 8,000 tokens
Context building: Neurology page summary prepended; graph shows "Dr. X, Neurologie, campus Sint-Jan"
LLM: Generates Dutch response with [1] citations, includes graph entity data

Best Practice Alignment

The pipeline aligns with 2025-2026 production RAG best practices:

Best Practice	Status	Reference
Hybrid search (vector + BM25)	Implemented	ADR-007
Knowledge graph for entities	Implemented	ADR-006
Contextual retrieval	Full (embedding + BM25 + generation-time)	ADR-0019
Canonical questions / HyPE	Partial (BM25 only, not embedded)	ADR-007
Cross-encoder reranking	Always-on (full mode)	ADR-0024
Intent-driven strategy selection	Implemented	Built-in
Multi-turn conversation context	Implemented (25% boost)	Built-in
Token budget management	Implemented (8,000 tokens)	ADR-007
Safety-first (no medical advice)	Implemented (multi-layer)	Built-in
Taxonomy-driven normalization	Implemented (960+ lines)	ADR-0014
Multi-hop query decomposition	Implemented (feature-flagged)	ADR-0032
Value Framework intent-category affinity rerank	Implemented (always-on, Stage 5b)	migration-066
Synthetic doctor-list injection	Implemented (Stage 5c, conditional)	`_qs_maybe_inject_doctor_list`

References

Foundational Research

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. — Seminal RAG paper establishing the retrieve-then-generate paradigm.
Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. — Theoretical foundation for BM25 scoring used in Stage 2.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009. — Foundation of the RRF algorithm used in Stage 3.
Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. — Established the dual-encoder dense retrieval paradigm underlying vector search.
Malkov, Y. A. & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs. IEEE TPAMI, 42(4), 824–836. — HNSW index algorithm used by pgvector.

Embedding Models

Chen, J., et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity. arXiv:2402.03216 (@chen2024bgem3). — Note: BGE-M3 was the primary embedding model from Feb–Apr 2026 (ADR-0033); since ADR-0048 (Apr 2026) it is used only for ColBERT reranking. Primary vector search now uses text-embedding-3-large.

Retrieval Enhancement Techniques

Anthropic. (2024). Introducing Contextual Retrieval. — 49% retrieval failure reduction with contextual embeddings + hybrid search.
Wang, Z., et al. (2023). Learning to Filter Context for Retrieval-Augmented Generation (FILCO). arXiv:2311.08377. — Context filtering reducing prompt lengths by 64% while improving generation quality.
Vake, L., et al. (2025). HyPE-RAG: Hypothetical Prompt Embeddings. SSRN. — Hypothetical Prompt Embeddings for query-aligned chunk retrieval.
Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085. — Foundational cross-encoder reranking approach used in Stage 4 (escalated search).
Bruch, S., et al. (2023). An Analysis of Fusion Functions for Hybrid Retrieval. ACM TOIS. — Analysis comparing RRF with linear combination for hybrid search.

Query Decomposition

Ammann, P. J. L., et al. (2025). Question Decomposition for Retrieval-Augmented Generation. — +36.7% MRR@10 improvement via sub-question decomposition (theoretical basis for Stage 1b).
Min, S., et al. (2019). Multi-hop Reading Comprehension through Question Decomposition and Rescoring. ACL 2019. — Foundational work on question decomposition for multi-hop reasoning.

GraphRAG and Knowledge Graph Integration

Peng, B., et al. (2025). Retrieval-Augmented Generation with Graphs (GraphRAG). — Comprehensive survey of Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced Generation.
Sarmah, B., et al. (2024). HybridRAG: Integrating Knowledge Graphs and Vector Retrieval. — Formalises the hybrid KG+vector approach used in Stage 2.

Industry References

Architecture for a Serious RAG System (RAGFlow)
Pinecone Rerankers Guide
ParadeDB Hybrid Search Manual
HALT-RAG Framework — Confidence-based abstention

Pipeline Overview​

Stage 1: Intent Classification + Query Rewriting​

Stage 2: Three Retrieval Channels​

Vector Search (Semantic Similarity)​

BM25 Keyword Search (Exact Term Matching)​

Canonical Questions​

Taxonomy Search (Entity Relationships)​

Stage 3: Score Fusion (RRF)​

Stage 4: Metadata Boosting​

Stage 5b: Value Framework Affinity Rerank​

Stage 5c: Synthetic Doctor-List Injection​

Stage 5: Keyword Rescue​

Stage 6: Context Assembly​

Stage 7: Context Building with Page Summaries​

Page Summary Injection​

Graph Context Separation​

Stage 8: LLM Generation​

Escalated Search (Think Harder)​

How Each Enhancement Contributes​

Data Flow: Complete Example​

Best Practice Alignment​

References​

Foundational Research​

Embedding Models​

Retrieval Enhancement Techniques​

Query Decomposition​

GraphRAG and Knowledge Graph Integration​

Industry References​