State-of-the-Art Assessment

This page evaluates the ZOL RAG retrieval pipeline against current state-of-the-art (SOTA) practices in embedding, retrieval, ranking, and knowledge-grounded routing.

Last verified

This assessment reflects the 302-question golden evaluation set v3.6, 99.0 % pass rate (296/299) sourced from the definitive baseline run 2026-03-21 (commit 1e22091, see thesis Chapter 4), the current taxonomy (2,663 entities, 3,591 relationships per @bodenreider2004umls-influenced consolidation), 5-layer safety architecture, SNOMED CT deep integration (@snomed_international), and OpenAI text-embedding-3-large embeddings (@openai2024embeddings).

Overall Verdict

The ZOL pipeline is solidly SOTA for a production medical RAG system, and in several dimensions it goes beyond standard practice. It implements the top recommendations from Anthropic's Contextual Retrieval research, hybrid search with RRF (@cormack2009rrf), cross-encoder reranking (@nogueira2019passagererank), optional ColBERT late-interaction reranking (@khattab2020colbert), and knowledge graph authority routing (@sarmah2024hybridrag, @edge2024graphrag) — all running with OpenAI text-embedding-3-large for embeddings (@openai2024embeddings) and direct API for LLM generation. The SNOMED CT medical terminology integration (656,287 Dutch description rows, IS_A hierarchical expansion via @snomed_international) places it ahead of most production medical RAG systems in multilingual entity resolution.

Pipeline Architecture

What We Do Well

1. Contextual Embeddings (Anthropic's Contextual Retrieval)

Status: Fully implemented

Each chunk is embedded as enriched text = chunk context + canonical questions + raw text. This directly implements Anthropic's Contextual Retrieval technique, which reduces the top-20-chunk retrieval failure rate by 35% compared to naive embedding.

[Chunk context: LLM-generated summary of surrounding content]
[Canonical questions: "Welke artsen werken bij Cardiologie?"]
[Raw chunk text: "De dienst Cardiologie biedt..."]

The chunk_context and canonical_questions are generated during ingestion by the LLM and stored in chunk metadata. At embedding time, all three components are concatenated and embedded together.

Reference: Anthropic (2024). Contextual Retrieval. Reduces retrieval failure by 35% (embeddings) and 49% (embeddings + BM25).

2. Hybrid Search with Reciprocal Rank Fusion

Status: Fully implemented (ADR-0020)

The pipeline combines dense vector search (semantic) with BM25 keyword search (lexical) using Reciprocal Rank Fusion (RRF) with k=60 (Cormack et al., 2009). This score-agnostic fusion consistently outperforms weighted linear combination and is the production standard for enterprise RAG as of 2025. See Hybrid Search Strategy for the formula and detailed explanation.

3. Cross-Encoder Reranking (Always-On)

Status: Fully implemented (ADR-0024)

After retrieval and fusion, all results pass through Jina Reranker v2 (API), with BAAI/bge-reranker-v2-m3 as a local fallback. The cross-encoder jointly scores each query-document pair, which is more accurate than the bi-encoder similarity score because it can attend to fine-grained query-document interactions.

Mode	Candidates Retrieved	Reranked To	Primary Model	Fallback
Normal (`rag_full_mode=True`)	20	top-10	Jina Reranker v2	bge-reranker-v2-m3
Escalated ("Think Harder")	100	20	Jina Reranker v2	bge-reranker-v2-m3

Graph results are pinned (excluded from reranking and prepended after), since their relevance is determined by entity matching rather than text similarity.

Reference: Cross-encoder reranking typically improves NDCG@5 by 5-15% in medical/high-stakes domains.

4. Keyword Rescue

Status: Implemented (novel addition)

When specific query terms (6+ characters, not stop words) don't appear in ANY retrieved results, a direct content search fetches up to 3 additional chunks. This handles the long tail of rare medical terms where neither embeddings nor BM25 produce relevant results.

5. Knowledge Graph with Authority Routing

Status: Fully implemented with graph authority boost

PostgreSQL taxonomy entities (doctors, departments, conditions, treatments, campuses) provide structured entity relationships. The system implements graph authority routing: for department routing questions (which department handles a condition), the knowledge graph is treated as the authoritative source, overriding vector search when they conflict. This is critical because vector search can surface tangential mentions of conditions in unrelated department pages, leading to incorrect routing.

The graph authority architecture:

Always-on graph injection — graph context is included for all condition, treatment, and symptom intents, even when vector search returns strong results
LLM conflict resolution — the system prompt explicitly instructs the LLM to trust the graph for department routing while using vector content for clinical details
SNOMED-enhanced condition resolution — the condition resolver merges 55+ hardcoded aliases with ~53 SNOMED-derived condition aliases, plus IS_A ancestor expansion for subtypes not directly in the taxonomy

This aligns with the GraphRAG trend and goes further by establishing a formal authority hierarchy: graph > vector for structural routing, vector > graph for content details.

Evaluation: 38/38 condition_department questions pass (100%), including 7 graph-authority-tagged questions where only graph routing produces the correct answer.

6. Multi-Signal Metadata Boosting

Status: Implemented

Beyond similarity and reranking, the pipeline applies multiple domain-specific boost signals that leverage enriched document metadata -- covering category relevance, conversation context continuity, content keyword presence, authority dampening, and more.

For the complete list of all signals with weights, trigger conditions, and rationale, see Stage 6: Metadata Boosting in the Query Pipeline documentation.

7. Page Summary Context

Status: Implemented

Page summaries (LLM-generated during ingestion) are not embedded — they are prepended to the first chunk of each document at LLM context assembly time. This provides the LLM with document-level context without inflating the embedding vector with summarization noise.

8. SNOMED CT Medical Terminology Integration

Status: Fully implemented (always-on)

The pipeline integrates SNOMED CT Belgian Edition (356K concepts, 656,287 active Dutch description rows, 7.3M rows across 4 PostgreSQL tables — see @snomed_international and @bodenreider2004umls for the parent UMLS framework) for medical terminology expansion at query time. This goes significantly beyond standard RAG practice, where terminology expansion is typically limited to simple synonym dictionaries.

Three resolution strategies operate in cascade:

Strategy	Mechanism	Example
Synonym expansion	SNOMED synonyms via JSON cache (154 entries)	voorhuidvernauwing → Nauwe voorhuid
IS_A hierarchical expansion	Walk up ancestor tree (max_depth=3), try each ancestor through taxonomy	specific subtype → parent condition → department
FINDING_SITE routing	Body structure → department mapping (51 curated mappings)	structure of urethra → Urologie

The SNOMED integration implements the BMQExpander pattern (Bhogal et al., 2007), where controlled vocabulary synonyms are used to expand queries before retrieval, adapted to the medical domain with SNOMED CT's hierarchical structure.

Reference: Bhogal, J., Macfarlane, A., & Smith, P. (2007). A Review of Ontology Based Query Expansion. Information Processing & Management, 43(4), 866–886.

See SNOMED CT Terminology for the full integration architecture.

9. ColBERT Multi-Vector Reranking (Optional)

Status: Implemented, feature-flagged (default: off)

An optional third-stage reranker using BGE-M3 in ColBERT mode provides late-interaction scoring. Unlike cross-encoders (which produce a single score), ColBERT computes per-token embeddings and scores via MaxSim (for each query token, find max similarity with any passage token, then sum). This preserves fine-grained term-level matching particularly valuable for:

Dutch compound word disambiguation (hartchirurgie vs hartritmestoornissen)
Doctor name precision (Dr. Mullens vs Dr. Peeters in cardiology)
Multi-entity queries combining specialist + campus + procedure
Terminology mismatch (patient colloquial → clinical terms)

The ColBERT stage runs after the primary cross-encoder reranker, using BGE-M3's native ColBERT capability (note: BGE-M3 is used only for ColBERT reranking, not for primary embeddings) (no separate model needed). Configuration: colbert_reranking_enabled=True, colbert_top_k=10.

Reference: Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. Expected +3-5% NDCG@10 from late interaction.

See Reranking Evaluation for empirical Dutch medical benchmarks.

Embedding Model: text-embedding-3-large

The system uses OpenAI text-embedding-3-large (3,072 dimensions, truncated to 1,536 for pgvector HNSW indexing) via the OpenAI direct API. This model was selected based on:

Model	MTEB-NL Retrieval	Dimensions	Dutch	Local	Cost
nomic-embed-text (previous)	N/A	768	Moderate	Ollama	Free
text-embedding-3-large (current)	64.6	1,536	Excellent	OpenAI API	$0.13/1M tokens
multilingual-e5-large-instruct	61.4	1024	Strong	No	Free
Cohere embed-v4	~65	1024	Strong	No	API cost
OpenAI text-3-large	~64.6	3072	Good	No	API cost

The system uses text-embedding-3-large from OpenAI, which provides superior multilingual retrieval quality compared to open-source alternatives. The MTEB benchmark confirms its leading position for multilingual embedding tasks.

Embedding-model lineage

The system migrated through three embedding models: nomic-embed-text (768d, ADR-0005) → BGE-M3 (1,024d, ADR-0033) → text-embedding-3-large (1,536d, ADR-0048). The first two migrations were quality-driven; the third was driven primarily by voice-channel latency (Ollama's CPU serialization tax) but preserved retrieval quality and removed an on-prem service. The current model is the highest-MTEB-scored multilingual embedder we evaluated.

Potential Improvements

Enable ColBERT Reranking (Low Effort, Moderate Impact)

ColBERT multi-vector reranking (@khattab2020colbert) is implemented but disabled by default. Enabling it adds a third reranking stage that provides fine-grained token-level matching. Literature suggests +3-5 % NDCG@10. Needs empirical validation on the full 302-question golden set v3.6 (the same set used for all current pass-rate claims on this page) before enabling in production.

Impact: +3-5% retrieval quality | Cost: ~200ms additional latency per query

Deeper SNOMED IS_A Expansion (Low Effort, Targeted Impact)

The current IS_A expansion walks 3 levels up the ancestor tree. For highly specific subtypes (e.g., rare genetic conditions), deeper traversal or combined IS_A + FINDING_SITE routing could resolve additional edge cases. The infrastructure is in place — only the max_depth parameter and combined strategy need tuning.

Impact: ~5-10 additional conditions resolved | Cost: Minimal code change, potential latency from deeper traversal

Dutch-Specific Fine-Tuned Model (Moderate Impact)

The MTEB-NL paper introduced E5-NL models fine-tuned specifically for Dutch (see also @muennighoff2022mteb for the parent MTEB framework). These models show competitive retrieval performance with smaller model sizes. The current setup already uses the OpenAI hosted API for embeddings (ADR-0048), so the operational-simplicity argument that previously protected the on-prem path no longer applies — the question is now whether ~1-3 % MTEB-NL gain justifies a new self-hosted model dependency. Given current pass rate at 99.0 %, the marginal value is small.

Impact: ~1-3% retrieval improvement | Cost: Custom model deployment, new ops surface

Late Chunking (Experimental)

Late chunking embeds entire documents first, then splits embeddings, improving handling of anaphoric references by 10-12%. However, our contextual embedding approach already addresses the same problem (prepending context before embedding). These are competing solutions to the same problem.

Impact: Marginal given existing contextual enrichment | Cost: Architectural change

Matryoshka / Adaptive Dimensions (Low Priority)

text-embedding-3-large supports native dimension reduction (Matryoshka embeddings) — the 3,072 dimensions are truncated to 1,536 for pgvector HNSW compatibility without significant quality loss.

Impact: Latency optimization only | Cost: Model change

Evaluation Results

The system is evaluated using the 302-question golden evaluation set v3.6 across 21 categories, with per-question expected entities, expected source URLs, and safety annotations. The denominator of 299 reflects three non-deterministic / cache-test items excluded from the headline rate. Statistical analysis follows @efron1993bootstrap for confidence intervals; LLM-as-judge metrics follow @zheng2023llmjudge; the broader RAG-evaluation methodology aligns with @es2023ragas, @thakur2021beir, and the IR-evaluation foundations of @voorhees2002philosophy and @manning2008ir.

Current Metrics (definitive baseline run 2026-03-21, commit `1e22091`)

Metric	Value	Source
Pass rate	99.0 % (296/299)	thesis Chapter 4, Table 4.1
Entity recall	0.932 (95% CI [0.916, 0.965])	thesis Chapter 4, Table 4.2
Faithfulness (RAGAS / LLM-judge)	0.959	DeepEval (@es2023ragas, @zheng2023llmjudge)
Safety refusal accuracy	100 % (14/14)	safety_refusal category
Adversarial GCG handling	100 % (12/12)	adversarial_gcg category — see @zou2023gcg, @liao2024amplegcg
Condition → department routing	100 % (46/46)	condition_department category, including graph-authority items
SNOMED terminology questions	100 % (33/33)	snomed_terminology category
Multi-hop graph	100 % (37/37)	multi_hop_graph category
Multilingual	100 % (16/16)	multilingual category
Median response time	7,829 ms	thesis Chapter 4, Table 4.3

Retrieval Quality Metrics

The evaluation framework tracks ranking-aware retrieval metrics:

Metric	What It Measures
NDCG@5	Whether relevant documents appear at the top of results
MRR	How quickly the first relevant document appears
Precision@5	What fraction of top-5 results are relevant
Recall@5	What fraction of relevant documents appear in top-5
Entity Recall	Whether expected entities appear in the response

These metrics use expected_source_urls from the golden question set as ground truth, with URL prefix matching and graded relevance scoring (commit ad0fa06), where a retrieved URL that is a sub-path of the expected URL receives partial credit.

Historical limitation (resolved): Early evaluation reports (pre-February 23, 2026) showed near-zero NDCG@5, MRR, Precision@5, and Recall@5 values (0.017--0.055). This was a measurement artifact: golden questions defined expected_source_urls at a coarse department-page level, while the system correctly retrieved specific sub-pages. Reports showing 0.000 were runs where expected_source_urls were not yet populated.

Graph Authority Evidence

Seven golden questions (GQ-262 to GQ-268) are tagged graph_authority and specifically test cases where only the knowledge graph provides the correct department routing — vector search consistently returns the wrong department due to tangential content mentions. All 7 pass at 100%, providing direct evidence for the knowledge graph's value proposition over pure vector search.

See Golden Questions for the full evaluation methodology.

Latency Optimization (ADR-0034)

Profiling (February 2026) revealed 14-30s end-to-end latency, well above the 3-8s SOTA target. The root cause breakdown:

Bottleneck	Measured Time	% of Total	Root Cause
Main LLM response	13-17s	73%	Resolved: all calls now use OpenAI direct API
Intent classification	2.5s	14%	4,694-token prompt (17 examples)
Follow-up suggestions	1-2s	8%	Blocking `await` before yielding final chunk
Retrieval + reranking	1.5-2s	8%	Already at parity with SOTA

Optimizations Applied

Direct OpenAI routing for main RAG response (all calls use OpenAI direct)
True token streaming enabled by default (perceived TTFT: ~18s -> ~2-3s)
Async follow-up suggestions decoupled from the critical path (saves 1-2s)
Slimmed intent prompt from 17 to 7 examples (~53% token reduction)
Rate limit retry with exponential backoff + automatic retry

Before/After Comparison

Metric	Before	After (estimated)
End-to-end latency	14-30s	6-12s
Perceived TTFT (streaming)	~18s	~2-3s
Main LLM token rate	23 tok/s	60-80 tok/s
Intent classification time	2.5s	~1.2s
Follow-up overhead on critical path	1-2s	0s

See Pipeline Latency for the full timing breakdown and configuration guide.

References

Foundational

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. Seminal RAG paper.
Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4). BM25 scoring function.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009. RRF algorithm.
Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. Dense retrieval paradigm.
Malkov, Y. A. & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs. IEEE TPAMI, 42(4), 824–836. HNSW index algorithm.

Embedding and Retrieval Quality

Chen, J., et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity. BGE-M3 is used for ColBERT reranking only, not primary embeddings.
Banar, N. & Lotfi, E. (2025). MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch. First comprehensive Dutch embedding benchmark.
Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. EMNLP 2019. Foundation for modern sentence embeddings.
Anthropic. (2024). Introducing Contextual Retrieval. 49% retrieval failure reduction.

Context Filtering and Enrichment

Wang, Z., et al. (2023). Learning to Filter Context for Retrieval-Augmented Generation (FILCO). Context filtering reducing prompt lengths by 64%.
Günther, M., et al. (2024). Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models. Alternative context-preserving approach.

Reranking

Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. Cross-encoder reranking paradigm.
Bruch, S., et al. (2023). An Analysis of Fusion Functions for Hybrid Retrieval. ACM TOIS. Systematic comparison of RRF vs. linear combination.

Knowledge Graph RAG

Peng, B., et al. (2025). Retrieval-Augmented Generation with Graphs (GraphRAG). Comprehensive GraphRAG survey.
Sarmah, B., et al. (2024). HybridRAG: Integrating Knowledge Graphs and Vector Retrieval. Hybrid KG+vector RAG formalisation.

Medical Terminology and ColBERT

Bhogal, J., Macfarlane, A., & Smith, P. (2007). A Review of Ontology Based Query Expansion. Information Processing & Management, 43(4), 866–886. BMQExpander pattern for controlled vocabulary expansion.
Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. Late-interaction multi-vector reranking.
Santhanam, K., et al. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL 2022. Compressed ColBERT with residual quantization.
Bodenreider, O. (2004). The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic Acids Research, 32(Database), D267–D270. Foundation for medical terminology integration.

RAG Surveys

Gao, Y., et al. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. Comprehensive RAG taxonomy (Naive/Advanced/Modular).
Fan, W., et al. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. Current landscape and future directions.

Overall Verdict​

Pipeline Architecture​

What We Do Well​

1. Contextual Embeddings (Anthropic's Contextual Retrieval)​

2. Hybrid Search with Reciprocal Rank Fusion​

3. Cross-Encoder Reranking (Always-On)​

4. Keyword Rescue​

5. Knowledge Graph with Authority Routing​

6. Multi-Signal Metadata Boosting​

7. Page Summary Context​

8. SNOMED CT Medical Terminology Integration​

9. ColBERT Multi-Vector Reranking (Optional)​

Embedding Model: text-embedding-3-large​

Potential Improvements​

Enable ColBERT Reranking (Low Effort, Moderate Impact)​

Deeper SNOMED IS_A Expansion (Low Effort, Targeted Impact)​

Dutch-Specific Fine-Tuned Model (Moderate Impact)​

Late Chunking (Experimental)​

Matryoshka / Adaptive Dimensions (Low Priority)​

Evaluation Results​

Current Metrics (definitive baseline run 2026-03-21, commit 1e22091)​

Retrieval Quality Metrics​

Graph Authority Evidence​

Latency Optimization (ADR-0034)​

Optimizations Applied​

Before/After Comparison​

References​

Foundational​

Embedding and Retrieval Quality​

Context Filtering and Enrichment​

Reranking​

Knowledge Graph RAG​

Medical Terminology and ColBERT​

RAG Surveys​