State-of-the-Art Assessment
This page evaluates the ZOL RAG retrieval pipeline against current state-of-the-art (SOTA) practices in embedding, retrieval, ranking, and knowledge-grounded routing.
This assessment reflects the 302-question golden evaluation set v3.6, 99.0 % pass rate (296/299) sourced from the definitive baseline run 2026-03-21 (commit 1e22091, see thesis Chapter 4), the current taxonomy (2,663 entities, 3,591 relationships per @bodenreider2004umls-influenced consolidation), 5-layer safety architecture, SNOMED CT deep integration (@snomed_international), and OpenAI text-embedding-3-large embeddings (@openai2024embeddings).
Overall Verdict
The ZOL pipeline is solidly SOTA for a production medical RAG system, and in several dimensions it goes beyond standard practice. It implements the top recommendations from Anthropic's Contextual Retrieval research, hybrid search with RRF (@cormack2009rrf), cross-encoder reranking (@nogueira2019passagererank), optional ColBERT late-interaction reranking (@khattab2020colbert), and knowledge graph authority routing (@sarmah2024hybridrag, @edge2024graphrag) — all running with OpenAI text-embedding-3-large for embeddings (@openai2024embeddings) and direct API for LLM generation. The SNOMED CT medical terminology integration (656,287 Dutch description rows, IS_A hierarchical expansion via @snomed_international) places it ahead of most production medical RAG systems in multilingual entity resolution.
Pipeline Architecture
What We Do Well
1. Contextual Embeddings (Anthropic's Contextual Retrieval)
Status: Fully implemented
Each chunk is embedded as enriched text = chunk context + canonical questions + raw text. This directly implements Anthropic's Contextual Retrieval technique, which reduces the top-20-chunk retrieval failure rate by 35% compared to naive embedding.
[Chunk context: LLM-generated summary of surrounding content]
[Canonical questions: "Welke artsen werken bij Cardiologie?"]
[Raw chunk text: "De dienst Cardiologie biedt..."]
The chunk_context and canonical_questions are generated during ingestion by the LLM and stored in chunk metadata. At embedding time, all three components are concatenated and embedded together.
Reference: Anthropic (2024). Contextual Retrieval. Reduces retrieval failure by 35% (embeddings) and 49% (embeddings + BM25).
2. Hybrid Search with Reciprocal Rank Fusion
Status: Fully implemented (ADR-0020)
The pipeline combines dense vector search (semantic) with BM25 keyword search (lexical) using Reciprocal Rank Fusion (RRF) with k=60 (Cormack et al., 2009). This score-agnostic fusion consistently outperforms weighted linear combination and is the production standard for enterprise RAG as of 2025. See Hybrid Search Strategy for the formula and detailed explanation.
3. Cross-Encoder Reranking (Always-On)
Status: Fully implemented (ADR-0024)
After retrieval and fusion, all results pass through Jina Reranker v2 (API), with BAAI/bge-reranker-v2-m3 as a local fallback. The cross-encoder jointly scores each query-document pair, which is more accurate than the bi-encoder similarity score because it can attend to fine-grained query-document interactions.
| Mode | Candidates Retrieved | Reranked To | Primary Model | Fallback |
|---|---|---|---|---|
Normal (rag_full_mode=True) | 20 | top-10 | Jina Reranker v2 | bge-reranker-v2-m3 |
| Escalated ("Think Harder") | 100 | 20 | Jina Reranker v2 | bge-reranker-v2-m3 |
Graph results are pinned (excluded from reranking and prepended after), since their relevance is determined by entity matching rather than text similarity.
Reference: Cross-encoder reranking typically improves NDCG@5 by 5-15% in medical/high-stakes domains.
4. Keyword Rescue
Status: Implemented (novel addition)
When specific query terms (6+ characters, not stop words) don't appear in ANY retrieved results, a direct content search fetches up to 3 additional chunks. This handles the long tail of rare medical terms where neither embeddings nor BM25 produce relevant results.
5. Knowledge Graph with Authority Routing
Status: Fully implemented with graph authority boost
PostgreSQL taxonomy entities (doctors, departments, conditions, treatments, campuses) provide structured entity relationships. The system implements graph authority routing: for department routing questions (which department handles a condition), the knowledge graph is treated as the authoritative source, overriding vector search when they conflict. This is critical because vector search can surface tangential mentions of conditions in unrelated department pages, leading to incorrect routing.
The graph authority architecture:
- Always-on graph injection — graph context is included for all condition, treatment, and symptom intents, even when vector search returns strong results
- LLM conflict resolution — the system prompt explicitly instructs the LLM to trust the graph for department routing while using vector content for clinical details
- SNOMED-enhanced condition resolution — the condition resolver merges 55+ hardcoded aliases with ~53 SNOMED-derived condition aliases, plus IS_A ancestor expansion for subtypes not directly in the taxonomy
This aligns with the GraphRAG trend and goes further by establishing a formal authority hierarchy: graph > vector for structural routing, vector > graph for content details.
Evaluation: 38/38 condition_department questions pass (100%), including 7 graph-authority-tagged questions where only graph routing produces the correct answer.
6. Multi-Signal Metadata Boosting
Status: Implemented
Beyond similarity and reranking, the pipeline applies multiple domain-specific boost signals that leverage enriched document metadata -- covering category relevance, conversation context continuity, content keyword presence, authority dampening, and more.
For the complete list of all signals with weights, trigger conditions, and rationale, see Stage 6: Metadata Boosting in the Query Pipeline documentation.
7. Page Summary Context
Status: Implemented
Page summaries (LLM-generated during ingestion) are not embedded — they are prepended to the first chunk of each document at LLM context assembly time. This provides the LLM with document-level context without inflating the embedding vector with summarization noise.
8. SNOMED CT Medical Terminology Integration
Status: Fully implemented (always-on)
The pipeline integrates SNOMED CT Belgian Edition (356K concepts, 656,287 active Dutch description rows, 7.3M rows across 4 PostgreSQL tables — see @snomed_international and @bodenreider2004umls for the parent UMLS framework) for medical terminology expansion at query time. This goes significantly beyond standard RAG practice, where terminology expansion is typically limited to simple synonym dictionaries.
Three resolution strategies operate in cascade:
| Strategy | Mechanism | Example |
|---|---|---|
| Synonym expansion | SNOMED synonyms via JSON cache (154 entries) | voorhuidvernauwing → Nauwe voorhuid |
| IS_A hierarchical expansion | Walk up ancestor tree (max_depth=3), try each ancestor through taxonomy | specific subtype → parent condition → department |
| FINDING_SITE routing | Body structure → department mapping (51 curated mappings) | structure of urethra → Urologie |
The SNOMED integration implements the BMQExpander pattern (Bhogal et al., 2007), where controlled vocabulary synonyms are used to expand queries before retrieval, adapted to the medical domain with SNOMED CT's hierarchical structure.
Reference: Bhogal, J., Macfarlane, A., & Smith, P. (2007). A Review of Ontology Based Query Expansion. Information Processing & Management, 43(4), 866–886.
See SNOMED CT Terminology for the full integration architecture.
9. ColBERT Multi-Vector Reranking (Optional)
Status: Implemented, feature-flagged (default: off)
An optional third-stage reranker using BGE-M3 in ColBERT mode provides late-interaction scoring. Unlike cross-encoders (which produce a single score), ColBERT computes per-token embeddings and scores via MaxSim (for each query token, find max similarity with any passage token, then sum). This preserves fine-grained term-level matching particularly valuable for:
- Dutch compound word disambiguation (hartchirurgie vs hartritmestoornissen)
- Doctor name precision (Dr. Mullens vs Dr. Peeters in cardiology)
- Multi-entity queries combining specialist + campus + procedure
- Terminology mismatch (patient colloquial → clinical terms)
The ColBERT stage runs after the primary cross-encoder reranker, using BGE-M3's native ColBERT capability (note: BGE-M3 is used only for ColBERT reranking, not for primary embeddings) (no separate model needed). Configuration: colbert_reranking_enabled=True, colbert_top_k=10.
Reference: Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. Expected +3-5% NDCG@10 from late interaction.
See Reranking Evaluation for empirical Dutch medical benchmarks.
Embedding Model: text-embedding-3-large
The system uses OpenAI text-embedding-3-large (3,072 dimensions, truncated to 1,536 for pgvector HNSW indexing) via the OpenAI direct API. This model was selected based on:
| Model | MTEB-NL Retrieval | Dimensions | Dutch | Local | Cost |
|---|---|---|---|---|---|
| nomic-embed-text (previous) | N/A | 768 | Moderate | Ollama | Free |
| text-embedding-3-large (current) | 64.6 | 1,536 | Excellent | OpenAI API | $0.13/1M tokens |
| multilingual-e5-large-instruct | 61.4 | 1024 | Strong | No | Free |
| Cohere embed-v4 | ~65 | 1024 | Strong | No | API cost |
| OpenAI text-3-large | ~64.6 | 3072 | Good | No | API cost |
The system uses text-embedding-3-large from OpenAI, which provides superior multilingual retrieval quality compared to open-source alternatives. The MTEB benchmark confirms its leading position for multilingual embedding tasks.
The system migrated through three embedding models: nomic-embed-text (768d, ADR-0005) → BGE-M3 (1,024d, ADR-0033) → text-embedding-3-large (1,536d, ADR-0048). The first two migrations were quality-driven; the third was driven primarily by voice-channel latency (Ollama's CPU serialization tax) but preserved retrieval quality and removed an on-prem service. The current model is the highest-MTEB-scored multilingual embedder we evaluated.
Potential Improvements
Enable ColBERT Reranking (Low Effort, Moderate Impact)
ColBERT multi-vector reranking (@khattab2020colbert) is implemented but disabled by default. Enabling it adds a third reranking stage that provides fine-grained token-level matching. Literature suggests +3-5 % NDCG@10. Needs empirical validation on the full 302-question golden set v3.6 (the same set used for all current pass-rate claims on this page) before enabling in production.
Impact: +3-5% retrieval quality | Cost: ~200ms additional latency per query
Deeper SNOMED IS_A Expansion (Low Effort, Targeted Impact)
The current IS_A expansion walks 3 levels up the ancestor tree. For highly specific subtypes (e.g., rare genetic conditions), deeper traversal or combined IS_A + FINDING_SITE routing could resolve additional edge cases. The infrastructure is in place — only the max_depth parameter and combined strategy need tuning.
Impact: ~5-10 additional conditions resolved | Cost: Minimal code change, potential latency from deeper traversal
Dutch-Specific Fine-Tuned Model (Moderate Impact)
The MTEB-NL paper introduced E5-NL models fine-tuned specifically for Dutch (see also @muennighoff2022mteb for the parent MTEB framework). These models show competitive retrieval performance with smaller model sizes. The current setup already uses the OpenAI hosted API for embeddings (ADR-0048), so the operational-simplicity argument that previously protected the on-prem path no longer applies — the question is now whether ~1-3 % MTEB-NL gain justifies a new self-hosted model dependency. Given current pass rate at 99.0 %, the marginal value is small.
Impact: ~1-3% retrieval improvement | Cost: Custom model deployment, new ops surface
Late Chunking (Experimental)
Late chunking embeds entire documents first, then splits embeddings, improving handling of anaphoric references by 10-12%. However, our contextual embedding approach already addresses the same problem (prepending context before embedding). These are competing solutions to the same problem.
Impact: Marginal given existing contextual enrichment | Cost: Architectural change
Matryoshka / Adaptive Dimensions (Low Priority)
text-embedding-3-large supports native dimension reduction (Matryoshka embeddings) — the 3,072 dimensions are truncated to 1,536 for pgvector HNSW compatibility without significant quality loss.
Impact: Latency optimization only | Cost: Model change
Evaluation Results
The system is evaluated using the 302-question golden evaluation set v3.6 across 21 categories, with per-question expected entities, expected source URLs, and safety annotations. The denominator of 299 reflects three non-deterministic / cache-test items excluded from the headline rate. Statistical analysis follows @efron1993bootstrap for confidence intervals; LLM-as-judge metrics follow @zheng2023llmjudge; the broader RAG-evaluation methodology aligns with @es2023ragas, @thakur2021beir, and the IR-evaluation foundations of @voorhees2002philosophy and @manning2008ir.
Current Metrics (definitive baseline run 2026-03-21, commit 1e22091)
| Metric | Value | Source |
|---|---|---|
| Pass rate | 99.0 % (296/299) | thesis Chapter 4, Table 4.1 |
| Entity recall | 0.932 (95% CI [0.916, 0.965]) | thesis Chapter 4, Table 4.2 |
| Faithfulness (RAGAS / LLM-judge) | 0.959 | DeepEval (@es2023ragas, @zheng2023llmjudge) |
| Safety refusal accuracy | 100 % (14/14) | safety_refusal category |
| Adversarial GCG handling | 100 % (12/12) | adversarial_gcg category — see @zou2023gcg, @liao2024amplegcg |
| Condition → department routing | 100 % (46/46) | condition_department category, including graph-authority items |
| SNOMED terminology questions | 100 % (33/33) | snomed_terminology category |
| Multi-hop graph | 100 % (37/37) | multi_hop_graph category |
| Multilingual | 100 % (16/16) | multilingual category |
| Median response time | 7,829 ms | thesis Chapter 4, Table 4.3 |
Retrieval Quality Metrics
The evaluation framework tracks ranking-aware retrieval metrics:
| Metric | What It Measures |
|---|---|
| NDCG@5 | Whether relevant documents appear at the top of results |
| MRR | How quickly the first relevant document appears |
| Precision@5 | What fraction of top-5 results are relevant |
| Recall@5 | What fraction of relevant documents appear in top-5 |
| Entity Recall | Whether expected entities appear in the response |
These metrics use expected_source_urls from the golden question set as ground truth, with URL prefix matching and graded relevance scoring (commit ad0fa06), where a retrieved URL that is a sub-path of the expected URL receives partial credit.
Historical limitation (resolved): Early evaluation reports (pre-February 23, 2026) showed near-zero NDCG@5, MRR, Precision@5, and Recall@5 values (0.017--0.055). This was a measurement artifact: golden questions defined
expected_source_urlsat a coarse department-page level, while the system correctly retrieved specific sub-pages. Reports showing0.000were runs whereexpected_source_urlswere not yet populated.
Graph Authority Evidence
Seven golden questions (GQ-262 to GQ-268) are tagged graph_authority and specifically test cases where only the knowledge graph provides the correct department routing — vector search consistently returns the wrong department due to tangential content mentions. All 7 pass at 100%, providing direct evidence for the knowledge graph's value proposition over pure vector search.
See Golden Questions for the full evaluation methodology.
Latency Optimization (ADR-0034)
Profiling (February 2026) revealed 14-30s end-to-end latency, well above the 3-8s SOTA target. The root cause breakdown:
| Bottleneck | Measured Time | % of Total | Root Cause |
|---|---|---|---|
| Main LLM response | 13-17s | 73% | Resolved: all calls now use OpenAI direct API |
| Intent classification | 2.5s | 14% | 4,694-token prompt (17 examples) |
| Follow-up suggestions | 1-2s | 8% | Blocking await before yielding final chunk |
| Retrieval + reranking | 1.5-2s | 8% | Already at parity with SOTA |
Optimizations Applied
- Direct OpenAI routing for main RAG response (all calls use OpenAI direct)
- True token streaming enabled by default (perceived TTFT: ~18s -> ~2-3s)
- Async follow-up suggestions decoupled from the critical path (saves 1-2s)
- Slimmed intent prompt from 17 to 7 examples (~53% token reduction)
- Rate limit retry with exponential backoff + automatic retry
Before/After Comparison
| Metric | Before | After (estimated) |
|---|---|---|
| End-to-end latency | 14-30s | 6-12s |
| Perceived TTFT (streaming) | ~18s | ~2-3s |
| Main LLM token rate | 23 tok/s | 60-80 tok/s |
| Intent classification time | 2.5s | ~1.2s |
| Follow-up overhead on critical path | 1-2s | 0s |
See Pipeline Latency for the full timing breakdown and configuration guide.
References
Foundational
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. Seminal RAG paper.
- Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4). BM25 scoring function.
- Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009. RRF algorithm.
- Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. Dense retrieval paradigm.
- Malkov, Y. A. & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs. IEEE TPAMI, 42(4), 824–836. HNSW index algorithm.
Embedding and Retrieval Quality
- Chen, J., et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity. BGE-M3 is used for ColBERT reranking only, not primary embeddings.
- Banar, N. & Lotfi, E. (2025). MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch. First comprehensive Dutch embedding benchmark.
- Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. EMNLP 2019. Foundation for modern sentence embeddings.
- Anthropic. (2024). Introducing Contextual Retrieval. 49% retrieval failure reduction.
Context Filtering and Enrichment
- Wang, Z., et al. (2023). Learning to Filter Context for Retrieval-Augmented Generation (FILCO). Context filtering reducing prompt lengths by 64%.
- Günther, M., et al. (2024). Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models. Alternative context-preserving approach.
Reranking
- Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. Cross-encoder reranking paradigm.
- Bruch, S., et al. (2023). An Analysis of Fusion Functions for Hybrid Retrieval. ACM TOIS. Systematic comparison of RRF vs. linear combination.
Knowledge Graph RAG
- Peng, B., et al. (2025). Retrieval-Augmented Generation with Graphs (GraphRAG). Comprehensive GraphRAG survey.
- Sarmah, B., et al. (2024). HybridRAG: Integrating Knowledge Graphs and Vector Retrieval. Hybrid KG+vector RAG formalisation.
Medical Terminology and ColBERT
- Bhogal, J., Macfarlane, A., & Smith, P. (2007). A Review of Ontology Based Query Expansion. Information Processing & Management, 43(4), 866–886. BMQExpander pattern for controlled vocabulary expansion.
- Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. Late-interaction multi-vector reranking.
- Santhanam, K., et al. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL 2022. Compressed ColBERT with residual quantization.
- Bodenreider, O. (2004). The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic Acids Research, 32(Database), D267–D270. Foundation for medical terminology integration.
RAG Surveys
- Gao, Y., et al. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. Comprehensive RAG taxonomy (Naive/Advanced/Modular).
- Fan, W., et al. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. Current landscape and future directions.