Skip to main content

State-of-the-Art Assessment

This page evaluates the ZOL RAG retrieval pipeline against current state-of-the-art (SOTA) practices in embedding, retrieval, ranking, and knowledge-grounded routing.

Last verified

This assessment reflects the 302-question golden evaluation set v3.6, 99.0 % pass rate (296/299) sourced from the definitive baseline run 2026-03-21 (commit 1e22091, see thesis Chapter 4), the current taxonomy (2,663 entities, 3,591 relationships per @bodenreider2004umls-influenced consolidation), 5-layer safety architecture, SNOMED CT deep integration (@snomed_international), and OpenAI text-embedding-3-large embeddings (@openai2024embeddings).

Overall Verdict

The ZOL pipeline is solidly SOTA for a production medical RAG system, and in several dimensions it goes beyond standard practice. It implements the top recommendations from Anthropic's Contextual Retrieval research, hybrid search with RRF (@cormack2009rrf), cross-encoder reranking (@nogueira2019passagererank), optional ColBERT late-interaction reranking (@khattab2020colbert), and knowledge graph authority routing (@sarmah2024hybridrag, @edge2024graphrag) — all running with OpenAI text-embedding-3-large for embeddings (@openai2024embeddings) and direct API for LLM generation. The SNOMED CT medical terminology integration (656,287 Dutch description rows, IS_A hierarchical expansion via @snomed_international) places it ahead of most production medical RAG systems in multilingual entity resolution.

Pipeline Architecture

What We Do Well

1. Contextual Embeddings (Anthropic's Contextual Retrieval)

Status: Fully implemented

Each chunk is embedded as enriched text = chunk context + canonical questions + raw text. This directly implements Anthropic's Contextual Retrieval technique, which reduces the top-20-chunk retrieval failure rate by 35% compared to naive embedding.

[Chunk context: LLM-generated summary of surrounding content]
[Canonical questions: "Welke artsen werken bij Cardiologie?"]
[Raw chunk text: "De dienst Cardiologie biedt..."]

The chunk_context and canonical_questions are generated during ingestion by the LLM and stored in chunk metadata. At embedding time, all three components are concatenated and embedded together.

Reference: Anthropic (2024). Contextual Retrieval. Reduces retrieval failure by 35% (embeddings) and 49% (embeddings + BM25).

2. Hybrid Search with Reciprocal Rank Fusion

Status: Fully implemented (ADR-0020)

The pipeline combines dense vector search (semantic) with BM25 keyword search (lexical) using Reciprocal Rank Fusion (RRF) with k=60 (Cormack et al., 2009). This score-agnostic fusion consistently outperforms weighted linear combination and is the production standard for enterprise RAG as of 2025. See Hybrid Search Strategy for the formula and detailed explanation.

3. Cross-Encoder Reranking (Always-On)

Status: Fully implemented (ADR-0024)

After retrieval and fusion, all results pass through Jina Reranker v2 (API), with BAAI/bge-reranker-v2-m3 as a local fallback. The cross-encoder jointly scores each query-document pair, which is more accurate than the bi-encoder similarity score because it can attend to fine-grained query-document interactions.

ModeCandidates RetrievedReranked ToPrimary ModelFallback
Normal (rag_full_mode=True)20top-10Jina Reranker v2bge-reranker-v2-m3
Escalated ("Think Harder")10020Jina Reranker v2bge-reranker-v2-m3

Graph results are pinned (excluded from reranking and prepended after), since their relevance is determined by entity matching rather than text similarity.

Reference: Cross-encoder reranking typically improves NDCG@5 by 5-15% in medical/high-stakes domains.

4. Keyword Rescue

Status: Implemented (novel addition)

When specific query terms (6+ characters, not stop words) don't appear in ANY retrieved results, a direct content search fetches up to 3 additional chunks. This handles the long tail of rare medical terms where neither embeddings nor BM25 produce relevant results.

5. Knowledge Graph with Authority Routing

Status: Fully implemented with graph authority boost

PostgreSQL taxonomy entities (doctors, departments, conditions, treatments, campuses) provide structured entity relationships. The system implements graph authority routing: for department routing questions (which department handles a condition), the knowledge graph is treated as the authoritative source, overriding vector search when they conflict. This is critical because vector search can surface tangential mentions of conditions in unrelated department pages, leading to incorrect routing.

The graph authority architecture:

  1. Always-on graph injection — graph context is included for all condition, treatment, and symptom intents, even when vector search returns strong results
  2. LLM conflict resolution — the system prompt explicitly instructs the LLM to trust the graph for department routing while using vector content for clinical details
  3. SNOMED-enhanced condition resolution — the condition resolver merges 55+ hardcoded aliases with ~53 SNOMED-derived condition aliases, plus IS_A ancestor expansion for subtypes not directly in the taxonomy

This aligns with the GraphRAG trend and goes further by establishing a formal authority hierarchy: graph > vector for structural routing, vector > graph for content details.

Evaluation: 38/38 condition_department questions pass (100%), including 7 graph-authority-tagged questions where only graph routing produces the correct answer.

6. Multi-Signal Metadata Boosting

Status: Implemented

Beyond similarity and reranking, the pipeline applies multiple domain-specific boost signals that leverage enriched document metadata -- covering category relevance, conversation context continuity, content keyword presence, authority dampening, and more.

For the complete list of all signals with weights, trigger conditions, and rationale, see Stage 6: Metadata Boosting in the Query Pipeline documentation.

7. Page Summary Context

Status: Implemented

Page summaries (LLM-generated during ingestion) are not embedded — they are prepended to the first chunk of each document at LLM context assembly time. This provides the LLM with document-level context without inflating the embedding vector with summarization noise.

8. SNOMED CT Medical Terminology Integration

Status: Fully implemented (always-on)

The pipeline integrates SNOMED CT Belgian Edition (356K concepts, 656,287 active Dutch description rows, 7.3M rows across 4 PostgreSQL tables — see @snomed_international and @bodenreider2004umls for the parent UMLS framework) for medical terminology expansion at query time. This goes significantly beyond standard RAG practice, where terminology expansion is typically limited to simple synonym dictionaries.

Three resolution strategies operate in cascade:

StrategyMechanismExample
Synonym expansionSNOMED synonyms via JSON cache (154 entries)voorhuidvernauwing → Nauwe voorhuid
IS_A hierarchical expansionWalk up ancestor tree (max_depth=3), try each ancestor through taxonomyspecific subtype → parent condition → department
FINDING_SITE routingBody structure → department mapping (51 curated mappings)structure of urethra → Urologie

The SNOMED integration implements the BMQExpander pattern (Bhogal et al., 2007), where controlled vocabulary synonyms are used to expand queries before retrieval, adapted to the medical domain with SNOMED CT's hierarchical structure.

Reference: Bhogal, J., Macfarlane, A., & Smith, P. (2007). A Review of Ontology Based Query Expansion. Information Processing & Management, 43(4), 866–886.

See SNOMED CT Terminology for the full integration architecture.

9. ColBERT Multi-Vector Reranking (Optional)

Status: Implemented, feature-flagged (default: off)

An optional third-stage reranker using BGE-M3 in ColBERT mode provides late-interaction scoring. Unlike cross-encoders (which produce a single score), ColBERT computes per-token embeddings and scores via MaxSim (for each query token, find max similarity with any passage token, then sum). This preserves fine-grained term-level matching particularly valuable for:

  • Dutch compound word disambiguation (hartchirurgie vs hartritmestoornissen)
  • Doctor name precision (Dr. Mullens vs Dr. Peeters in cardiology)
  • Multi-entity queries combining specialist + campus + procedure
  • Terminology mismatch (patient colloquial → clinical terms)

The ColBERT stage runs after the primary cross-encoder reranker, using BGE-M3's native ColBERT capability (note: BGE-M3 is used only for ColBERT reranking, not for primary embeddings) (no separate model needed). Configuration: colbert_reranking_enabled=True, colbert_top_k=10.

Reference: Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. Expected +3-5% NDCG@10 from late interaction.

See Reranking Evaluation for empirical Dutch medical benchmarks.

Embedding Model: text-embedding-3-large

The system uses OpenAI text-embedding-3-large (3,072 dimensions, truncated to 1,536 for pgvector HNSW indexing) via the OpenAI direct API. This model was selected based on:

ModelMTEB-NL RetrievalDimensionsDutchLocalCost
nomic-embed-text (previous)N/A768ModerateOllamaFree
text-embedding-3-large (current)64.61,536ExcellentOpenAI API$0.13/1M tokens
multilingual-e5-large-instruct61.41024StrongNoFree
Cohere embed-v4~651024StrongNoAPI cost
OpenAI text-3-large~64.63072GoodNoAPI cost

The system uses text-embedding-3-large from OpenAI, which provides superior multilingual retrieval quality compared to open-source alternatives. The MTEB benchmark confirms its leading position for multilingual embedding tasks.

Embedding-model lineage

The system migrated through three embedding models: nomic-embed-text (768d, ADR-0005) → BGE-M3 (1,024d, ADR-0033) → text-embedding-3-large (1,536d, ADR-0048). The first two migrations were quality-driven; the third was driven primarily by voice-channel latency (Ollama's CPU serialization tax) but preserved retrieval quality and removed an on-prem service. The current model is the highest-MTEB-scored multilingual embedder we evaluated.

Potential Improvements

Enable ColBERT Reranking (Low Effort, Moderate Impact)

ColBERT multi-vector reranking (@khattab2020colbert) is implemented but disabled by default. Enabling it adds a third reranking stage that provides fine-grained token-level matching. Literature suggests +3-5 % NDCG@10. Needs empirical validation on the full 302-question golden set v3.6 (the same set used for all current pass-rate claims on this page) before enabling in production.

Impact: +3-5% retrieval quality | Cost: ~200ms additional latency per query

Deeper SNOMED IS_A Expansion (Low Effort, Targeted Impact)

The current IS_A expansion walks 3 levels up the ancestor tree. For highly specific subtypes (e.g., rare genetic conditions), deeper traversal or combined IS_A + FINDING_SITE routing could resolve additional edge cases. The infrastructure is in place — only the max_depth parameter and combined strategy need tuning.

Impact: ~5-10 additional conditions resolved | Cost: Minimal code change, potential latency from deeper traversal

Dutch-Specific Fine-Tuned Model (Moderate Impact)

The MTEB-NL paper introduced E5-NL models fine-tuned specifically for Dutch (see also @muennighoff2022mteb for the parent MTEB framework). These models show competitive retrieval performance with smaller model sizes. The current setup already uses the OpenAI hosted API for embeddings (ADR-0048), so the operational-simplicity argument that previously protected the on-prem path no longer applies — the question is now whether ~1-3 % MTEB-NL gain justifies a new self-hosted model dependency. Given current pass rate at 99.0 %, the marginal value is small.

Impact: ~1-3% retrieval improvement | Cost: Custom model deployment, new ops surface

Late Chunking (Experimental)

Late chunking embeds entire documents first, then splits embeddings, improving handling of anaphoric references by 10-12%. However, our contextual embedding approach already addresses the same problem (prepending context before embedding). These are competing solutions to the same problem.

Impact: Marginal given existing contextual enrichment | Cost: Architectural change

Matryoshka / Adaptive Dimensions (Low Priority)

text-embedding-3-large supports native dimension reduction (Matryoshka embeddings) — the 3,072 dimensions are truncated to 1,536 for pgvector HNSW compatibility without significant quality loss.

Impact: Latency optimization only | Cost: Model change

Evaluation Results

The system is evaluated using the 302-question golden evaluation set v3.6 across 21 categories, with per-question expected entities, expected source URLs, and safety annotations. The denominator of 299 reflects three non-deterministic / cache-test items excluded from the headline rate. Statistical analysis follows @efron1993bootstrap for confidence intervals; LLM-as-judge metrics follow @zheng2023llmjudge; the broader RAG-evaluation methodology aligns with @es2023ragas, @thakur2021beir, and the IR-evaluation foundations of @voorhees2002philosophy and @manning2008ir.

Current Metrics (definitive baseline run 2026-03-21, commit 1e22091)

MetricValueSource
Pass rate99.0 % (296/299)thesis Chapter 4, Table 4.1
Entity recall0.932 (95% CI [0.916, 0.965])thesis Chapter 4, Table 4.2
Faithfulness (RAGAS / LLM-judge)0.959DeepEval (@es2023ragas, @zheng2023llmjudge)
Safety refusal accuracy100 % (14/14)safety_refusal category
Adversarial GCG handling100 % (12/12)adversarial_gcg category — see @zou2023gcg, @liao2024amplegcg
Condition → department routing100 % (46/46)condition_department category, including graph-authority items
SNOMED terminology questions100 % (33/33)snomed_terminology category
Multi-hop graph100 % (37/37)multi_hop_graph category
Multilingual100 % (16/16)multilingual category
Median response time7,829 msthesis Chapter 4, Table 4.3

Retrieval Quality Metrics

The evaluation framework tracks ranking-aware retrieval metrics:

MetricWhat It Measures
NDCG@5Whether relevant documents appear at the top of results
MRRHow quickly the first relevant document appears
Precision@5What fraction of top-5 results are relevant
Recall@5What fraction of relevant documents appear in top-5
Entity RecallWhether expected entities appear in the response

These metrics use expected_source_urls from the golden question set as ground truth, with URL prefix matching and graded relevance scoring (commit ad0fa06), where a retrieved URL that is a sub-path of the expected URL receives partial credit.

Historical limitation (resolved): Early evaluation reports (pre-February 23, 2026) showed near-zero NDCG@5, MRR, Precision@5, and Recall@5 values (0.017--0.055). This was a measurement artifact: golden questions defined expected_source_urls at a coarse department-page level, while the system correctly retrieved specific sub-pages. Reports showing 0.000 were runs where expected_source_urls were not yet populated.

Graph Authority Evidence

Seven golden questions (GQ-262 to GQ-268) are tagged graph_authority and specifically test cases where only the knowledge graph provides the correct department routing — vector search consistently returns the wrong department due to tangential content mentions. All 7 pass at 100%, providing direct evidence for the knowledge graph's value proposition over pure vector search.

See Golden Questions for the full evaluation methodology.

Latency Optimization (ADR-0034)

Profiling (February 2026) revealed 14-30s end-to-end latency, well above the 3-8s SOTA target. The root cause breakdown:

BottleneckMeasured Time% of TotalRoot Cause
Main LLM response13-17s73%Resolved: all calls now use OpenAI direct API
Intent classification2.5s14%4,694-token prompt (17 examples)
Follow-up suggestions1-2s8%Blocking await before yielding final chunk
Retrieval + reranking1.5-2s8%Already at parity with SOTA

Optimizations Applied

  1. Direct OpenAI routing for main RAG response (all calls use OpenAI direct)
  2. True token streaming enabled by default (perceived TTFT: ~18s -> ~2-3s)
  3. Async follow-up suggestions decoupled from the critical path (saves 1-2s)
  4. Slimmed intent prompt from 17 to 7 examples (~53% token reduction)
  5. Rate limit retry with exponential backoff + automatic retry

Before/After Comparison

MetricBeforeAfter (estimated)
End-to-end latency14-30s6-12s
Perceived TTFT (streaming)~18s~2-3s
Main LLM token rate23 tok/s60-80 tok/s
Intent classification time2.5s~1.2s
Follow-up overhead on critical path1-2s0s

See Pipeline Latency for the full timing breakdown and configuration guide.

References

Foundational

Embedding and Retrieval Quality

Context Filtering and Enrichment

Reranking

Knowledge Graph RAG

Medical Terminology and ColBERT

RAG Surveys