Context Retrieval Architecture
The ZOL Intelligent Search system uses a multi-signal retrieval pipeline that combines vector search (@karpukhin2020dpr), BM25 keyword matching (Robertson & Zaragoza 2009), knowledge-graph traversal, metadata boosting, the always-on Value Framework affinity rerank (Stage 5b), the conditional synthetic doctor-list injection (Stage 5c), and contextual enrichment into a unified context-retrieval mechanism. This page explains how each signal contributes to the final context the LLM reads.
See ADR-0017 for the architectural decision record.
This page numbers the retrieval-focused stages sequentially (1-8). For the end-to-end pipeline stage numbers (which include the semantic cache check and quality evaluation), see the Query Processing Pipeline. The mapping is: this page's Stage 1 = Pipeline Stage 2-3, Stage 2 = Pipeline Stage 5, Stage 3-4 = Pipeline Stage 5-6, and so on.
Pipeline Overview
Stage 1: Intent Classification + Query Rewriting
Corresponds to Pipeline Stages 2-3
Before any retrieval occurs, the system classifies the user's intent and rewrites follow-up queries:
- Intent classification (Tier 2, ~400ms): Maps the query to one of 12 intent categories. Four intents (
out_of_scope_medical_advice,off_topic,other_hospital,vague_input) are blocked immediately with a safety response. - Query rewriting (Tier 2, combined with intent classification when history exists): The LLM reformulates every query — regardless of input language — into a canonical Dutch clinical sentence, and resolves follow-up pronouns against conversation history. This maximizes both retrieval consistency and cache hit rates. The canonical templates, the three-purposes rationale, and the cache relationship are documented on the Query Rewriting page.
Strategy selection: All non-blocked intents default to the HYBRID retrieval strategy, which runs vector search and taxonomy search sequentially. The retrieval strategy enum retains VECTOR_ONLY and GRAPH_ONLY values for future use, but all safe intents currently route to HYBRID.
After intent classification, two additional pre-retrieval steps occur:
- Taxonomy enrichment (Step 5b):
resolve_search_query()maps patient-friendly Dutch terms to canonical entity names (e.g., "hartfilmpje" to "ECG") and appends resolved department/examination names to the search query. - Query decomposition (Step 5c, feature-flagged): For complex multi-hop questions, an LLM gate detects queries requiring multiple independent evidence chains and decomposes them into focused sub-questions. Each sub-question is retrieved in parallel, and results are merged. See the dedicated Query Decomposition page for details.
Stage 2: Three Retrieval Channels
Corresponds to Pipeline Stage 5: Sequential Hybrid Retrieval
Three retrieval channels execute sequentially (asyncpg does not support concurrent queries on the same session), each optimized for different query types:
Vector Search (Semantic Similarity)
pgvector (pgvector docs) cosine similarity search against 1,536-dimensional text-embedding-3-large embeddings (OpenAI 2024; see ADR-0048 for the migration from BGE-M3), using HNSW indexing (Malkov & Yashunin, 2018) — the dense bi-encoder retrieval pattern follows Karpukhin et al. 2020.
- Over-fetches
limit × 4candidates (default: 28 candidates for 7 final results) - Minimum similarity threshold: 0.40
- Embedding prefix convention:
search_query:(queries) /search_document:(documents) is only used whenEMBEDDING_PROVIDER=ollamaand a Nomic model is configured. The defaulttext-embedding-3-large(OpenAI API) uses no prefixes.
Strengths: Semantic paraphrases ("pijn in de borst" matches "thoracale klachten"), conceptual similarity.
Weaknesses: Specific medical terms and proper names may score below the similarity threshold.
BM25 Keyword Search (Exact Term Matching)
BM25-style keyword search (Robertson & Zaragoza, 2009) via PostgreSQL tsvector with GIN index, using 'simple' text configuration (no stemming):
search_vector = to_tsvector('simple', title || ' ' || content || ' ' || canonical_questions)
The 'simple' config preserves exact Dutch medical terminology — stemming would incorrectly conflate terms like "behandeling" and "behandelingen" for different medical contexts.
Strengths: Exact keyword matches ("Dr. Vanderstraeten", "cardioversie"), proper names, technical terms.
Weaknesses: Cannot detect synonyms or semantic equivalents.
Canonical Questions
Each chunk stores 1-2 pre-generated Dutch questions in chunk_metadata.canonical_questions, included in the BM25 tsvector. Generated at ingestion time by the Tier 2 model.
Example: A chunk about visiting hours generates:
- "Wat zijn de bezoekuren van het ZOL ziekenhuis?"
- "Wanneer kan ik op bezoek komen?"
These questions are injected into the search_vector, so a user query matching the generated question text gets a BM25 boost even if the chunk content uses different phrasing.
Canonical questions currently only affect BM25 keyword search. They are not embedded as separate vectors. This is a known gap — full HyPE (Hypothetical Prompt Embeddings) would store questions as additional embedding vectors for even better recall.
Taxonomy Search (Entity Relationships)
PostgreSQL taxonomy search using a tiered approach:
| Tier | Method | Use Case | Example |
|---|---|---|---|
| Tier 1 | PostgreSQL taxonomy queries | Structured entity lookups | "Which doctors work in Cardiology?" |
| Tier 1b | Taxonomy alias resolution | Indirect entity matching | "hartfalen" → Hartfalen → Cardiologie via HANDLES |
Taxonomy alias resolution: Before graph queries execute, resolve_search_query() maps patient-friendly Dutch terms to canonical names. Example aliases:
| Patient Term | Canonical Form |
|---|---|
| huidarts | Dermatologie |
| suikerziekte | Diabetes Mellitus |
| hartfilmpje | ECG |
| oogarts | Oftalmologie |
| scan | Radiologie |
This ensures that queries using everyday language still find the correct graph entities.
Stage 3: Score Fusion (RRF)
Part of Pipeline Stage 5
Results from vector and BM25 are combined using Reciprocal Rank Fusion (RRF; Cormack et al., 2009) with k=60:
rrf_score = Σ 1/(k + rank_i + 1) for each result list i containing the document
Chunks found only by BM25 (not in vector results) receive similarity=0.0 and are ranked purely by their RRF score. This means BM25-only chunks compete on rank position rather than receiving an artificial similarity floor.
Graph results are merged with priority ordering: typed graph > vector > semantic graph. Typed node results receive fixed high similarity scores (0.90-0.95) reflecting their high precision for entity queries.
Stage 4: Metadata Boosting
Corresponds to Pipeline Stage 6: Metadata Boosting
Multiple multiplicative boost signals re-rank results based on contextual metadata, covering category relevance, recency, entity type alignment, conversation continuity, and more. These signals leverage enriched document metadata populated during ingestion.
For the complete metadata boosting algorithm with all signals, weights, conditions, and rationale, see Stage 6: Metadata Boosting in the Query Pipeline documentation.
Stage 5b: Value Framework Affinity Rerank
Always-on rerank step that runs after Stage 4 and before Stage 5. Each retrieved chunk is classified into one of six content categories (practical, appointments, general, legal_admin, clinical_info, regulatory) and its score is multiplied by the matching coefficient from the intent × content_category affinity matrix. Chunks are then re-sorted by the new score. The rerank prevents cross-category contamination (e.g., the wheelchair-vs-cardiology regression where high vector similarity to "rolstoel" surfaced cardiology content under a navigational intent). Telemetry is written to app.category_mismatch_telemetry per query for operator review. See Query Pipeline §Stage 5b for the algorithm and Reranking & Evaluation for the broader rerank pipeline.
Stage 5c: Synthetic Doctor-List Injection
Conditional rerank step that fires only when intent in DOCTOR_LOOKUP or DEPARTMENT_OR_SERVICE_LOOKUP AND the query contains a list-signal phrase AND a department hint can be resolved. When all three gates pass, the stage queries the taxonomy for all doctors associated with the resolved department and inserts a synthetic chunk listing them into the retrieved-chunks set before context assembly. This guarantees the LLM has the full roster when answering "alle X-ologen bij ZOL"-style questions; without it, retrieval might surface a partial list of individual doctor brochures. When any gate fails, the stage is a no-op (zero latency). See Taxonomy Query Enrichment §Stage 5c and Query Pipeline §Stage 5c.
Stage 5: Keyword Rescue
A last-resort fallback that catches specific terms (6+ characters) missed by both vector and BM25:
- Extract specific terms from the query (excluding Dutch stop words)
- Check if each term appears in any of the top-K results
- For missing terms, execute a direct
ILIKE '%term%'search on chunk content - Rescued chunks receive hardcoded scores (
similarity=0.85, boosted_score=0.90)
Example: The query "Welke arts bij psoriasis?" — if "psoriasis" (9 chars) does not appear in any top result, the rescue search finds chunks containing that exact term.
Keyword rescue uses unindexed ILIKE which is O(N) over all chunks. At the current corpus size this is fast, but it should be revisited if the corpus grows significantly.
Stage 6: Context Assembly
Corresponds to Pipeline Stage 6b: Context Assembly
See the dedicated Context Assembly page for details. In brief:
- Expand: Fetch ±1 adjacent chunks per retrieved chunk
- Deduplicate: Strip ~70-token chunking overlaps
- Group by document: Merge chunks from same document into coherent blocks
- Token budget: Cap at 8,000 tokens, dropping lowest-relevance blocks first
Stage 7: Context Building with Page Summaries
The assembled chunks are formatted into the final context string that the LLM reads.
Page Summary Injection
Each document's page summary (generated during ingestion by the LLM Entity Validator) is prepended to the first chunk from that document:
[1] Uit cardiologie_raadpleging.html (pagina 1):
[Pagina context: Deze pagina beschrijft de afdeling Cardiologie van ZOL,
inclusief de artsen, behandelingen en consultatiemogelijkheden op campus Sint-Jan.]
De raadpleging duurt gemiddeld 30 minuten. U brengt best uw identiteitskaart
en verwijsbrief mee.
Without the page summary, the LLM would not know which department this consultation information refers to. The summary resolves this ambiguity without adding an LLM call at query time (summaries are pre-computed).
The system implements Anthropic's contextual retrieval pattern at all three levels: (1) LLM-generated chunk context is prepended before embedding for enriched vector search, (2) the enriched text is used for BM25 indexing, and (3) page summaries are prepended at generation time when building the LLM context. This three-level approach reduces retrieval failure rates by up to 67% compared to naive chunking. See ADR-0019.
Graph Context Separation
When graph results are present, they appear after a separator:
--- AANVULLENDE ZOL INFORMATIE ---
Dr. Van den Berg is verbonden aan de afdeling Cardiologie op campus Sint-Jan.
Consultatie: maandag en woensdag.
The RAG system prompt includes special instructions for handling graph context, telling the LLM to integrate this structured data with the document context.
Stage 8: LLM Generation
Corresponds to Pipeline Stage 7: Response Generation
The final messages sent to the LLM:
- System prompt (Dutch): Strict grounding rules, citation format
[1], no medical advice - Conversation history: Last 5 exchanges for multi-turn context
- User message:
Contextdocumenten:\n\{context\}\n\nVraag: \{question\}
Model: Tier 2 (gpt-4.1 direct), fallback chain to local Ollama model. Tier 3 (gpt-5.2) is reserved for escalated search only.
Escalated Search (Think Harder)
When users signal dissatisfaction, the Think Harder flow provides enhanced retrieval:
| Aspect | Normal Pipeline | Escalated |
|---|---|---|
| Candidates | 20 (full mode) | 100 |
| Min similarity | 0.40 | 0.35 |
| Reranker | Jina Reranker v2 (BGE-reranker-v2-m3 fallback) | Jina Reranker v2 (BGE-reranker-v2-m3 fallback) |
| After reranking | Top 15 (full mode) | Top 20 (configurable via rag_escalation_rerank_top_k) |
| LLM model | Tier 2 (gpt-4.1) | Escalation model (gpt-5.2) |
| Max tokens | 1,000 / 1,500 (full mode) | 3,000 |
The cross-encoder reranker jointly encodes (query, document) pairs, providing more accurate relevance scoring than the bi-encoder similarity used in normal retrieval.
How Each Enhancement Contributes
| Enhancement | Stage | Impact | Query Types Helped |
|---|---|---|---|
| Canonical questions | 2 (BM25) | Bridges vocabulary gap | "Wat zijn de bezoekuren?" → exact match |
| Page summaries | 7 (Context) | Disambiguates chunks | "consultatie" → which department? |
| Taxonomy aliases | 2 (Graph) | Dutch term resolution | "huidarts" → Dermatologie |
| Metadata boosting | 4 | Re-ranks by relevance signals | Follow-up queries, campus-specific queries |
| Content keyword boost | 4 | Promotes exact term presence | Specific medical terms |
| Keyword rescue | 5 | Catches missed terms | Rare procedures, specific doctor names |
| Context assembly | 6 | Coherent document sections | All content queries |
| Query decomposition | 1b | Splits multi-hop into sub-queries | "Doctor X on campus Y doing procedure Z" |
| Cross-encoder reranking | Escalated | Precision improvement | Complex queries after initial failure |
Data Flow: Complete Example
Query: "Welke arts behandelt epilepsie bij kinderen?"
- Intent:
doctor_lookup(confidence: 0.92) → forces HYBRID - Vector: Returns 28 candidates about neurology, pediatrics, epilepsy info
- BM25: Matches "epilepsie" and "kinderen" in chunk content + canonical questions
- Graph: Typed node query finds doctors with condition "Epilepsie" via HANDLES relationship (Department handles Condition)
- Score fusion: RRF (k=60); graph results merged first
- Metadata boost: Neurology documents get +20% category match; pediatrics docs get +10% entity type
- Keyword rescue: "epilepsie" (9 chars) verified present in top results — no rescue needed
- Context assembly: Expand ±1 for neurology page, deduplicate, group, budget to 8,000 tokens
- Context building: Neurology page summary prepended; graph shows "Dr. X, Neurologie, campus Sint-Jan"
- LLM: Generates Dutch response with
[1]citations, includes graph entity data
Best Practice Alignment
The pipeline aligns with 2025-2026 production RAG best practices:
| Best Practice | Status | Reference |
|---|---|---|
| Hybrid search (vector + BM25) | Implemented | ADR-007 |
| Knowledge graph for entities | Implemented | ADR-006 |
| Contextual retrieval | Full (embedding + BM25 + generation-time) | ADR-0019 |
| Canonical questions / HyPE | Partial (BM25 only, not embedded) | ADR-007 |
| Cross-encoder reranking | Always-on (full mode) | ADR-0024 |
| Intent-driven strategy selection | Implemented | Built-in |
| Multi-turn conversation context | Implemented (25% boost) | Built-in |
| Token budget management | Implemented (8,000 tokens) | ADR-007 |
| Safety-first (no medical advice) | Implemented (multi-layer) | Built-in |
| Taxonomy-driven normalization | Implemented (960+ lines) | ADR-0014 |
| Multi-hop query decomposition | Implemented (feature-flagged) | ADR-0032 |
| Value Framework intent-category affinity rerank | Implemented (always-on, Stage 5b) | migration-066 |
| Synthetic doctor-list injection | Implemented (Stage 5c, conditional) | _qs_maybe_inject_doctor_list |
References
Foundational Research
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. — Seminal RAG paper establishing the retrieve-then-generate paradigm.
- Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. — Theoretical foundation for BM25 scoring used in Stage 2.
- Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009. — Foundation of the RRF algorithm used in Stage 3.
- Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. — Established the dual-encoder dense retrieval paradigm underlying vector search.
- Malkov, Y. A. & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs. IEEE TPAMI, 42(4), 824–836. — HNSW index algorithm used by pgvector.
Embedding Models
- Chen, J., et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity. arXiv:2402.03216 (@chen2024bgem3). — Note: BGE-M3 was the primary embedding model from Feb–Apr 2026 (ADR-0033); since ADR-0048 (Apr 2026) it is used only for ColBERT reranking. Primary vector search now uses
text-embedding-3-large.
Retrieval Enhancement Techniques
- Anthropic. (2024). Introducing Contextual Retrieval. — 49% retrieval failure reduction with contextual embeddings + hybrid search.
- Wang, Z., et al. (2023). Learning to Filter Context for Retrieval-Augmented Generation (FILCO). arXiv:2311.08377. — Context filtering reducing prompt lengths by 64% while improving generation quality.
- Vake, L., et al. (2025). HyPE-RAG: Hypothetical Prompt Embeddings. SSRN. — Hypothetical Prompt Embeddings for query-aligned chunk retrieval.
- Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085. — Foundational cross-encoder reranking approach used in Stage 4 (escalated search).
- Bruch, S., et al. (2023). An Analysis of Fusion Functions for Hybrid Retrieval. ACM TOIS. — Analysis comparing RRF with linear combination for hybrid search.
Query Decomposition
- Ammann, P. J. L., et al. (2025). Question Decomposition for Retrieval-Augmented Generation. — +36.7% MRR@10 improvement via sub-question decomposition (theoretical basis for Stage 1b).
- Min, S., et al. (2019). Multi-hop Reading Comprehension through Question Decomposition and Rescoring. ACL 2019. — Foundational work on question decomposition for multi-hop reasoning.
GraphRAG and Knowledge Graph Integration
- Peng, B., et al. (2025). Retrieval-Augmented Generation with Graphs (GraphRAG). — Comprehensive survey of Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced Generation.
- Sarmah, B., et al. (2024). HybridRAG: Integrating Knowledge Graphs and Vector Retrieval. — Formalises the hybrid KG+vector approach used in Stage 2.
Industry References
- Architecture for a Serious RAG System (RAGFlow)
- Pinecone Rerankers Guide
- ParadeDB Hybrid Search Manual
- HALT-RAG Framework — Confidence-based abstention