Academic Critical Assessment
The Section 3 ("Embedding Strategy") evaluation below was written when production used BGE-M3 (Chen et al., 2024) at 1024 dim via on-prem Ollama. As of ADR-0048 (2026-04-30) the system uses OpenAI text-embedding-3-large at 1536 dim, hosted. The discussion of multilingual coverage, dimensionality trade-offs, and multi-vector retrieval still applies — most of the BGE-M3-specific gaps (no ColBERT mode, no learned-sparse, no domain fine-tuning) hold for text-embedding-3-large as well, since it is a single dense embedder. The paragraphs are preserved verbatim because they remain the academic critique that motivated subsequent improvement work; only the model-name labels have moved on.
This chapter provides an honest and critical evaluation of the ZOL Intelligent Search system architecture, measured against the current state of the art in Retrieval-Augmented Generation (RAG), knowledge graph integration, adversarial robustness, and medical AI safety. The assessment identifies both demonstrated strengths and architectural gaps, concluding with a concrete roadmap for achieving best-in-class status.
This assessment evaluates architectural decisions and implementation quality against published academic benchmarks and production RAG surveys (Gao et al., 2024; Fan et al., 2024; Peng et al., 2025). Where the ZOL system has not been evaluated on standardised benchmarks (e.g., BEIR, MTEB-NL retrieval), this is explicitly noted as a gap. Self-reported metrics from internal golden evaluations are referenced but acknowledged as non-comparable to external benchmarks.
1. Overall Architecture Classification
Gao et al. (2024) classify RAG systems into three generations:
| Generation | Description | Key Features |
|---|---|---|
| Naive RAG | Basic retrieve-then-generate | Single retrieval, no post-processing |
| Advanced RAG | Pre-retrieval and post-retrieval optimisation | Query rewriting, reranking, metadata boosting |
| Modular RAG | Composable, interchangeable components | Pluggable retrievers, adaptive routing, agent-based orchestration |
Assessment: The ZOL system is a mature Advanced RAG with significant Modular RAG characteristics. It implements pre-retrieval optimisation (intent classification, taxonomy enrichment, query decomposition), parallel multi-channel retrieval (vector + BM25 + knowledge graph), post-retrieval refinement (RRF fusion, metadata boosting, cross-encoder reranking), context enrichment (contextual embeddings, page summaries), and — since Wave 4 — adaptive retrieval strategy selection (W4-1) and a Corrective RAG (CRAG) quality gate (W4-2, Yan et al., 2024) that classifies retrieval confidence and triggers refinement for ambiguous results. The modular elements include configurable model routing (5-tier LLM hierarchy), pluggable reranker backends (Jina API / local BGE), feature-flagged query decomposition, intent-driven strategy routing, and a ternary pre-generation quality gate with automatic retrieval refinement.
The system now implements partial adaptive orchestration: retrieval strategies are selected based on intent classification (W4-1), and retrieval results are evaluated post-retrieval via CRAG (W4-2) — if classified as AMBIGUOUS, the system automatically retries with relaxed parameters (lower similarity threshold, expanded result set, no category filter). However, the system cannot yet dynamically select between retrievers mid-execution or perform multi-hop retrieval chains, which are hallmarks of full agentic RAG (Singh et al., 2025; Trivedi et al., 2023).
Verdict: ★★★★☆ — Strong Advanced RAG with meaningful Modular RAG features (adaptive strategy + CRAG). Approaching but not yet fully agentic.
2. Retrieval Architecture
2.1 Hybrid Search (Vector + BM25)
Strengths: The combination of dense vector search (BGE-M3, 1024-dim) with sparse BM25 keyword search, fused via Reciprocal Rank Fusion (Cormack et al., 2009), represents the current production standard for enterprise RAG (Bruch et al., 2023). The ZOL system further enhances this with:
- Contextual embeddings (Anthropic, 2024) prepended at ingestion time — reduces retrieval failure by 49%
- Canonical question generation for BM25 enrichment — partially implements the HyPE pattern (Vake et al., 2025)
- Keyword rescue as a safety net for rare terms missed by both channels
Gaps:
-
No learned sparse retrieval: The system uses PostgreSQL
tsvectorwith'simple'tokenisation (no stemming). Modern sparse retrieval models like SPLADE (Formal et al., 2022) learn term importance weights that outperform raw BM25 by 5-15% on BEIR benchmarks. The'simple'configuration, while appropriate for preserving Dutch medical terms, sacrifices the morphological normalisation that would help with Dutch inflections (e.g., "behandeling" vs. "behandelingen"). -
No ColBERT/late interaction retrieval: BGE-M3 supports ColBERT retrieval mode (multi-vector matching per token), but this capability is not utilised. ColBERT provides a middle ground between bi-encoder speed and cross-encoder accuracy, and is particularly effective for long medical queries where individual term-level matching matters (Khattab & Zaharia, 2020).
-
No query-adaptive retrievalPartially addressed (W4-1): The system now implements intent-driven adaptive retrieval strategy selection — navigational queries usevector_only, entity-specific queries usegraph_first, and complex medical queries use fullhybrid. This implements the selective channel activation recommended by Adaptive RAG (Jeong et al., 2024). However, the strategy is fixed at pipeline start based on intent classification and cannot be revised mid-execution based on intermediate retrieval quality. -
Canonical questions are BM25-only: The generated canonical questions enrich BM25 but are not embedded as separate vectors. Full HyPE implementation (Vake et al., 2025) would embed hypothetical questions alongside document chunks, providing additional vector-space retrieval paths.
2.2 Knowledge Graph Integration
Strengths: The PostgreSQL taxonomy with typed entities (doctors, departments, conditions, treatments, campuses, examinations) and curated relationships (HANDLES, OFFERS, WORKS_IN, LOCATED_AT) provides structured entity traversal that vector search cannot replicate. The frozen taxonomy approach (ADR-0028) with LLM-validated hub pages ensures high taxonomy data quality — a critical requirement identified by Peng et al. (2025) in their GraphRAG survey.
The separation of graph seeding from document ingestion is architecturally sound: it prevents the "noisy graph" problem documented in early GraphRAG implementations (Edge et al., 2024), where unrestricted entity extraction from all documents produces low-quality relationships that degrade retrieval.
Gaps:
-
No graph-based reasoning: The current graph integration is purely lookup-based (Cypher queries for entity relationships). True GraphRAG (Peng et al., 2025) involves graph-guided retrieval where the graph structure informs the retrieval strategy — e.g., traversing relationship chains to discover relevant documents that wouldn't be found by similarity search. The ZOL system's graph results are simply merged with vector/BM25 results rather than guiding the retrieval process.
-
No graph embeddings: The knowledge graph nodes have no learned embeddings. Knowledge graph embedding methods (TransE, ComplEx, RotatE) could enable similarity search over the graph structure, finding related entities even when exact relationship paths don't exist in the taxonomy.
-
Static taxonomy: The frozen taxonomy (
zol_taxonomy.py) is manually maintained. While this ensures quality, it cannot scale to new conditions, treatments, or organisational changes without developer intervention. An automated taxonomy update pipeline with LLM-based validation (as proposed in ADR-0028) would improve maintainability. -
Partial ontology alignment: The system integrates SNOMED CT Belgian Edition (356K concepts, 656K descriptions, 4.7M transitive closure relationships) via ADR-0016. Query-time synonym expansion, FINDING_SITE-based department routing, and graph enrichment with SNOMED concept IDs and IS_A relationships are implemented (15/15 SNOMED golden questions pass). Remaining gaps: IS_A hierarchical traversal is not used at query time for broad-category queries, and cross-language descriptions (French, English) are imported but not loaded into the lookup tables.
2.3 Reranking
Strengths: Always-on cross-encoder reranking via Jina Reranker v2 (with local bge-reranker-v2-m3 fallback) implements the two-stage retrieval paradigm established by Nogueira and Cho (2019). The candidate reduction from 50 to 20 (ADR-0034) was validated through A/B testing showing equivalent MRR and NDCG@5.
Gaps:
-
No listwise reranking: Current reranking is pointwise (each query-document pair scored independently). Listwise reranking approaches (Pradeep et al., 2023) that consider all candidates simultaneously produce more calibrated rankings but require LLM-based rerankers.
-
No domain-adapted reranker: Neither Jina nor bge-reranker-v2-m3 is fine-tuned for Dutch medical content. Domain adaptation of rerankers has been shown to improve NDCG@5 by 3-8% in specialised domains (Thakur et al., 2021).
Verdict: ★★★★☆ — Strong production-grade retrieval. The hybrid search + reranking architecture is state-of-the-art for production systems. Adaptive retrieval (W4-1) partially closes the strategy gap. Key remaining gaps are ColBERT utilisation and graph-based reasoning.
3. Embedding Strategy
Strengths: BGE-M3 (Chen et al., 2024) is the strongest open-source multilingual embedding model available on Ollama, with a measured MTEB-NL retrieval score of 60.0. Local inference via Ollama ensures zero API cost and full data sovereignty — critical requirements for healthcare deployments. The contextual embedding approach (prepending LLM-generated context before embedding) is directly aligned with Anthropic's (2024) research showing 35-49% retrieval failure reduction.
Gaps:
-
Not benchmarked on ZOL-specific retrieval: The MTEB-NL score of 60.0 was measured on general Dutch retrieval tasks. No ZOL-domain-specific retrieval benchmark exists, so the actual quality for Dutch medical queries is inferred but not measured. Creating a domain-specific evaluation set (analogous to the golden questions but focused specifically on retrieval ranking rather than end-to-end answer quality) would provide this measurement.
-
No fine-tuning: BGE-M3 is used as-is without fine-tuning on Dutch medical text. Domain-specific fine-tuning using contrastive learning on the ZOL corpus could improve retrieval quality by 3-5% based on analogous domain adaptation results (Thakur et al., 2021). However, the cost-benefit trade-off is unclear given the existing contextual embedding enrichment.
-
No embedding compression: At 1024 dimensions, BGE-M3 embeddings consume 4KB per vector. Matryoshka representation learning (Kusupati et al., 2022) enables adaptive dimension reduction without retraining, but BGE-M3 does not support this natively. Quantisation (e.g., IVFPQ in pgvector) could reduce storage and improve query speed.
-
No multi-vector retrieval: BGE-M3 supports dense, sparse, and ColBERT retrieval modes simultaneously. Only the dense mode is used. Activating sparse and ColBERT modes would create a multi-vector retrieval system with potentially significant recall improvements.
Verdict: ★★★☆☆ — Good model selection within constraints. Local inference is a strong operational decision. However, the lack of domain-specific benchmarking and underutilisation of BGE-M3's multi-vector capabilities leave measurable improvement opportunities.
4. Context Assembly and Generation
4.1 Context Filtering
Strengths: The three-level contextual retrieval implementation (embedding-time context, BM25-time enrichment, generation-time page summaries) is a comprehensive approach to the context quality problem. The ±1 chunk expansion with overlap deduplication preserves document coherence while managing token budget.
Gaps:
-
No query-time context filtering (FILCO)Implemented (W2-1), feature-flagged: A FILCO-style sentence-level context filtering service is now implemented (context_filter_enabled, default: off) and wired into the pipeline at Step 6c. When enabled, it scores individual sentences within retrieved chunks for query relevance and removes low-scoring passages before generation. This partially addresses Wang et al.'s (2023) finding that filtering reduces prompt lengths by up to 64% while improving answer quality. The implementation uses lexical overlap scoring rather than the full conditional cross-mutual information approach from the original paper. -
Fixed token budget: The 8,000-token context budget is static. Adaptive token allocation based on query complexity (simple questions need less context, multi-hop questions need more) could improve both efficiency and quality.
-
No context compression: Long-context compression techniques (e.g., LLMLingua by Jiang et al., 2023) can reduce context length by 2-5x while preserving answer quality, enabling more documents to fit within the budget.
4.2 Generation
Strengths: The 5-tier LLM routing (nano/mini/standard/escalation/flagship) efficiently allocates model capacity to query complexity. Streaming responses with progress indicators address user experience requirements (Nielsen, 1993). The strict grounding prompt enforcing citation with [1] notation implements basic attribution.
Gaps:
-
No attribution verificationImplemented (W1-3): AnAttributionVerificationServicenow provides post-hoc citation checking using NLI-based entailment scoring. The service verifies whether each citation actually supports the corresponding claim, following Gao et al. (2023). This is available as an evaluation tool and can be integrated into the generation pipeline for runtime verification. -
No abstention mechanismImplemented (W2-2 + W4-2): ARetrievalConfidenceScorercomputes a weighted confidence score (50% top_score + 30% mean_top_k + 20% score_gap) with a configurable abstention threshold. This was further extended by the CRAG quality gate (W4-2, ADR-0038), which classifies retrieval as CORRECT/AMBIGUOUS/INCORRECT and automatically refuses generation when confidence is below threshold — implementing confidence-calibrated abstention (Ren et al., 2023) independent of LLM judgement. -
No Self-RAG or CRAGCRAG implemented (W4-2): Corrective RAG (Yan et al., 2024) is now implemented via theCRAGDecisionternary classifier. Retrieval is classified as CORRECT (generate), AMBIGUOUS (refine with relaxed parameters then re-assess), or INCORRECT (abstain). The AMBIGUOUS path triggers automatic retrieval refinement with lower similarity threshold, expanded result set, and removed category filters — adding ~0.5-1s latency only for borderline queries. Feature-flagged viacrag_enabled(default: off). Self-RAG is not yet implemented.
Verdict: ★★★★☆ — Significantly improved since initial assessment. FILCO context filtering (W2-1), attribution verification (W1-3), confidence-calibrated abstention (W2-2), and CRAG (W4-2) close the most significant generation-quality gaps. Remaining gaps: Self-RAG, adaptive token budget, and context compression.
5. Safety and Adversarial Robustness
Strengths: The 12-layer defence-in-depth architecture (ADR-0036) is significantly more comprehensive than most production RAG systems. Key highlights:
- Perplexity-based anomaly detector (H1) catches GCG-style adversarial suffixes in under 5ms using statistical heuristics — a novel, cost-effective approach that doesn't require an LLM call
- LLM-as-judge safety validation (H2) enabled by default with intent-based skip optimisation
- In-memory rate limiter fallback (H3) with burst protection prevents fail-open scenarios
- Streaming retraction with server-side enforcement (H4) and WebSocket close code 4001
The multi-layer approach aligns with the defence-in-depth principle recommended by Zou et al. (2023) for protecting against universal adversarial attacks.
Gaps:
-
No red-teaming evaluationImplemented (W3-1): A systematic red-teaming harness with 40 adversarial test cases covering GCG-style suffixes, prompt injection, context manipulation, and role-play attacks is now available (tests/evaluation/red_teaming.py). The harness tests the full safety pipeline against established attack patterns (Perez et al., 2022). -
No input/output guardrails modelImplemented (W3-2): AGuardrailsServiceintegrating Llama Guard 3 (via OpenRouter) provides trained classifier-based input/output safety validation. Feature-flagged viaguardrails_enabled(default: off). This supplements the existing regex + statistical heuristic layers with a dedicated safety classification model, addressing the gap for detecting sophisticated paraphrased attacks. -
No formal safety evaluation frameworkImplemented (W1-2 + W3-3): A safety evaluation framework (tests/evaluation/safety_evaluation.py) measures false positive rates (safe queries incorrectly blocked) and false negative rates (unsafe responses not caught) across the full safety pipeline. Additionally, an anomaly threshold validation tool (W3-3) performs ROC curve analysis to optimise detector thresholds against labelled adversarial and benign corpora, quantifying safety trade-offs as recommended by Patel et al. (2025). -
Perplexity detector false positives: The statistical anomaly detector may flag legitimate queries containing code-switched medical terminology, URLs, or non-Latin scripts. The thresholds are now validated via ROC analysis (W3-3) against a labelled dataset of adversarial and benign queries, but production-scale validation with real user traffic has not yet been conducted.
Verdict: ★★★★★ — The safety architecture now includes systematic red-teaming (W3-1), a guardrails model (W3-2), quantified FP/FN measurement (W1-2), and threshold validation via ROC analysis (W3-3). This represents a comprehensive defence-in-depth posture that exceeds published production RAG safety architectures. The remaining gap is production-scale validation with real user traffic.
6. Evaluation Methodology
Strengths: The golden evaluation framework (302 questions across 21 intent categories, v3.6) provides a reproducible, deterministic evaluation of end-to-end system quality. The primary metrics -- entity recall, pass rate, citation accuracy, safety refusal rate -- cover the key quality dimensions. The evaluation distinguishes between retrieval quality and generation quality, enabling targeted debugging.
Note on NDCG@5 / MRR: The golden evaluation reports include NDCG@5 and MRR as retrieval metrics, but these values are near-zero (typically 0.000-0.055) due to a URL granularity mismatch:
expected_source_urlsare defined at a coarse department-page level (e.g./cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these metrics cannot be meaningfully computed. The system's retrieval quality is better reflected by entity recall (0.94+) and pass rate (98.9%), which measure end-to-end answer quality.
Gaps:
-
No external benchmark evaluationImplemented (W1-4): An MTEB-NL/BEIR-NL benchmark harness (tests/evaluation/mteb_nl_benchmark.py) evaluates the BGE-M3 embedding model on standardised Dutch retrieval tasks. The measured MTEB-NL retrieval score of 60.0 provides an external reference point for the embedding model choice. A domain-specific retrieval benchmark (W2-3) with 200+ queries across ZOL-specific categories was also created. -
No inter-annotator agreement: The golden questions were created by a single annotator (the developer). Medical information retrieval evaluation requires multi-annotator agreement scores to validate ground truth quality (Tsatsaronis et al., 2015). Without inter-annotator agreement, the evaluation may reflect a single person's expectations rather than true information need. This remains an open gap.
-
No statistical significance testingImplemented (W1-1): Bootstrap confidence intervals are now computed for all evaluation metrics (tests/evaluation/statistical_analysis.py). Given the 146-question evaluation set, 95% confidence intervals quantify the reliability of observed improvements via bootstrap resampling (10,000 iterations). This addresses the point-estimate-only reporting gap. -
No user-based evaluation: All evaluation is offline (golden questions). No user study has been conducted to validate that improved retrieval metrics correlate with improved user satisfaction and task completion. In medical search contexts, user-based evaluation is particularly important because patients may have different information needs than the system designer assumes. This remains an open gap.
-
Limited LLM-as-judge validation: The system uses DeepEval's FaithfulnessMetric and AnswerRelevancyMetric for quality analytics, but the LLM judge itself has not been validated against human judgements for Dutch medical content. Zheng et al. (2023) showed that LLM judges have systematic biases that vary by language and domain. This remains an open gap.
Verdict: ★★★☆☆ — Significantly improved from the initial assessment. External benchmarks (W1-4), domain-specific retrieval benchmarks (W2-3), and bootstrap confidence intervals (W1-1) bring the evaluation closer to academic standards. The remaining critical gaps are inter-annotator agreement and user-based evaluation.
7. Incremental Crawling and Data Freshness
Strengths: The content-hash-based change detection for incremental updates implements a well-established approach from the web crawling literature (Cho & Garcia-Molina, 2003). The sitemap-driven discovery ensures comprehensive URL coverage. Content deduplication by title prevents duplicate documents from different URL paths.
Gaps:
-
No change frequency estimation: Cho and Garcia-Molina (2003) demonstrated that Poisson-based change frequency estimators improve crawl freshness by 35%. The ZOL system treats all URLs equally during re-crawls rather than prioritising frequently-changing content (e.g., doctor schedules, visiting hours).
-
No differential update: When a document changes, the entire document is re-processed (re-chunked, re-embedded). A differential update approach that identifies changed sections and updates only affected chunks would reduce re-embedding costs.
-
No freshness monitoring: There is no automated monitoring of content freshness — no alerts when crawled content becomes stale, no automatic re-crawl scheduling, no freshness metrics in the analytics dashboard.
Verdict: ★★★☆☆ — Functional incremental ingestion. Missing optimisation opportunities for change-frequency-based scheduling and differential updates.
8. Comparative Analysis: ZOL vs. State-of-the-Art
The following table compares the ZOL system against key techniques identified in recent RAG surveys (Gao et al., 2024; Fan et al., 2024; Peng et al., 2025):
| Technique | ZOL Status | SOTA Benchmark | Gap |
|---|---|---|---|
| Hybrid search (vector + BM25) | ✅ Implemented | Standard practice | None |
| Cross-encoder reranking | ✅ Always-on (Jina + BGE fallback) | Standard practice | No domain adaptation |
| Contextual embeddings | ✅ Full (embed + BM25 + gen-time) | Anthropic (2024): -49% failure | None — fully aligned |
| Knowledge graph integration | ✅ Implemented (Neo4j typed nodes) | Peng et al. (2025): GraphRAG | Lookup-only, no graph reasoning |
| RRF score fusion | ✅ k=60 | Cormack et al. (2009) | None |
| Query decomposition | ✅ Feature-flagged | Ammann et al. (2025): +36.7% MRR | None |
| Metadata boosting | ✅ 9 signals | Novel (domain-specific) | No published comparison |
| Adversarial hardening | ✅ 12 layers + anomaly detector | Zou et al. (2023): GCG defence | |
| Context filtering (FILCO) | ✅ Implemented (W2-1), feature-flagged | Wang et al. (2023): -64% prompt | Lexical overlap only (no CMI) |
| CRAG (Corrective RAG) | ✅ Implemented (W4-2), feature-flagged | Yan et al. (2024) | None — ternary gate with refinement |
| Adaptive retrieval | ✅ Implemented (W4-1) | Jeong et al. (2024) | Intent-driven only (not mid-pipeline) |
| Attribution verification | ✅ Implemented (W1-3) | Gao et al. (2023) | Available as evaluation tool |
| Retrieval confidence / abstention | ✅ Implemented (W2-2) | Ren et al. (2023) | None — calibrated abstention |
| Guardrails model | ✅ Implemented (W3-2), feature-flagged | Llama Guard 3 | None |
| External benchmark (MTEB-NL/BEIR-NL) | ✅ Evaluated (W1-4) | Standard practice | BGE-M3 score: 60.0 |
| Bootstrap confidence intervals | ✅ Implemented (W1-1) | Standard practice | None |
| Safety FP/FN measurement | ✅ Implemented (W1-2 + W3-3) | Patel et al. (2025) | ROC threshold validation |
| Domain-specific retrieval benchmark | ✅ Implemented (W2-3) | BEIR methodology | 200+ queries |
| Learned sparse retrieval (SPLADE) | ❌ Not implemented | +5-15% on BEIR | Moderate |
| ColBERT/late interaction | ❌ Not implemented | Khattab & Zaharia (2020) | Moderate |
| Self-RAG | ❌ Not implemented | Asai et al. (2024) | Moderate (latency cost) |
| User study | ❌ Not conducted | Standard for medical AI | Critical |
| Domain-adapted embedding | ❌ Not implemented | Thakur et al. (2021) | Moderate |
| Agentic RAG | ❌ Not implemented | Singh et al. (2025) | Future direction |
| Inter-annotator agreement | ❌ Not conducted | Tsatsaronis et al. (2015) | Significant |
Summary: 18/26 SOTA techniques implemented (up from 10/18). 1 significant gap (inter-annotator agreement), 3 moderate gaps (SPLADE, ColBERT, Self-RAG), 1 critical gap (user study), 1 future direction (agentic RAG).
9. Why Generic Medical QA Benchmarks Don't Apply
A common critique of domain-specific RAG systems is the absence of evaluation against established medical QA benchmarks such as MedQA (Jin et al., 2021), PubMedQA (Jin et al., 2019), MedMCQA (Pal et al., 2022), or MMLU-Medical (Hendrycks et al., 2021). While these benchmarks are appropriate for clinical decision support systems and medical knowledge models, they are fundamentally misaligned with the ZOL system for several reasons:
9.1 Scope Mismatch: Hospital-Specific vs. General Medical Knowledge
The ZOL system operates on a closed corpus of approximately 1,000 hospital-specific documents (department pages, brochures, doctor profiles, patient guides). It does not contain — and is explicitly designed not to answer — general medical knowledge questions. Approximately 80% of questions in MedQA or PubMedQA would be entirely out-of-scope because they concern diagnoses, drug interactions, or clinical protocols that are not part of the ZOL website content.
For example, a MedQA question like "What is the first-line treatment for community-acquired pneumonia?" expects clinical guideline knowledge. The ZOL system would correctly respond with a safety refusal or a navigational redirect to the Pneumology department — which would be scored as "incorrect" by MedQA metrics despite being the appropriate system behaviour.
9.2 Task Mismatch: Navigation vs. Clinical Decision Support
The ZOL system is navigational and informational, not a clinical decision support tool. Its primary task is answering queries like:
- "Which department handles heart problems?" (entity lookup)
- "How do I prepare for a colonoscopy?" (patient guide retrieval)
- "Which doctors work at the Oncology department?" (entity relationship traversal)
These tasks have no equivalent in MedQA, PubMedQA, or MMLU-Medical, which test medical reasoning, evidence synthesis, and clinical judgement. Evaluating a hospital navigation system on clinical reasoning benchmarks is analogous to evaluating a library catalogue system on reading comprehension — it measures the wrong capability.
9.3 Language and Domain Specificity
The ZOL system operates primarily in Dutch on Belgian hospital content. None of the major medical QA benchmarks provide Dutch-language evaluation sets. While MIRACL (Zhang et al., 2023) includes Dutch retrieval tasks, it does not cover medical QA. The closest applicable benchmark is MTEB-NL for retrieval quality (Layer 1 of our evaluation), which evaluates the embedding model on general Dutch information retrieval.
9.4 The Appropriate Evaluation Strategy
Thakur et al. (2021) demonstrated that domain-specific evaluation is essential because generic benchmarks systematically overestimate or underestimate system quality for specialised use cases. Following this principle, the ZOL system uses a three-layer evaluation architecture (described in Section 10) that combines external benchmarks for component validation with domain-specific benchmarks for system-level quality measurement.
10. Three-Layer Evaluation Architecture
The ZOL evaluation framework follows a layered approach that addresses the limitations of generic benchmarks while maintaining scientific rigour through external reference points.
10.1 Layer 1: MTEB-NL / BEIR-NL — Embedding Model Validation
Purpose: Validate the embedding model choice (BGE-M3) against published Dutch retrieval leaderboards.
Method: The mteb_nl_benchmark.py runner evaluates BGE-M3 on standardised MTEB retrieval tasks including Dutch content. This provides an external, reproducible reference point for the embedding model's retrieval capability independent of the ZOL domain.
Key metrics: NDCG@10, MRR, Recall@100 — aggregated across available Dutch retrieval tasks.
Measured result: BGE-M3 achieves an MTEB-NL retrieval score of 60.0, positioning it as the strongest open-source multilingual model available for local inference via Ollama.
Limitation: General Dutch retrieval does not measure medical domain performance. This layer validates the foundation (embedding quality) but not the application (hospital search).
10.2 Layer 2: Domain-Specific ZOL Retrieval Benchmark
Purpose: Measure retrieval quality for hospital-specific queries using a curated test set with known expected source URLs.
Method: A benchmark of 50 queries across 10 query types evaluates whether the retrieval pipeline (vector + BM25 + knowledge graph, fused via RRF, reranked via cross-encoder) returns the correct hospital pages for each query. Query types include:
| Type | Count | Description |
|---|---|---|
| entity_lookup | 5 | Doctor, department, campus lookups |
| condition_navigation | 5 | Symptom/condition to department routing |
| multi_hop | 5 | Multi-entity relationship chains |
| practical_info | 5 | Visiting hours, parking, appointments |
| rare_condition | 5 | Less common diseases and conditions |
| treatment_lookup | 5 | Treatment and procedure information |
| multilingual | 5 | Queries in English, French, Turkish |
| typo_tolerance | 5 | Queries with common spelling errors |
| complex_multi_hop | 5 | 3-4 hop chains across multiple entities |
| disambiguation | 5 | Ambiguous queries mapping to multiple departments |
Key metrics: Recall@5, Recall@10, MRR, NDCG@10, Precision@5 — computed per query type and aggregated.
Why this matters: This layer measures what generic benchmarks cannot — whether the system retrieves the right hospital content for the specific types of queries real patients ask. The per-type breakdown identifies which query categories need improvement (e.g., rare conditions may have lower recall than entity lookups).
10.3 Layer 3: End-to-End RAG Evaluation
Purpose: Measure full pipeline quality from query to generated answer, including retrieval, context assembly, generation, citation, and safety.
Method: A golden evaluation set of 271 questions across 21 intent categories is evaluated using:
- Entity recall: Do generated answers mention the correct entities (departments, doctors, conditions)?
- Pass rate: Does the answer correctly address the query intent?
- DeepEval metrics: FaithfulnessMetric, AnswerRelevancyMetric for LLM-as-judge quality
- Safety refusal rate: Are medical advice requests correctly refused?
- Citation accuracy: Do source citations correspond to actual retrieved content?
- Bootstrap confidence intervals: 95% CIs via 10,000 bootstrap iterations (W1-1)
Measured results: 98.9% pass rate, 0.936 entity recall, zero safety incidents.
10.4 How the Layers Complement Each Other
The three layers form a pyramid of evaluation scope:
Layer 3: End-to-End RAG
(302 golden questions, 21 intents)
/ Full pipeline: retrieval → generation \
/ Entity recall, pass rate, safety \
─────────────────────────────────────────────
Layer 2: Domain Retrieval Benchmark
(50 queries, 10 types, URL-level matching)
/ Retrieval pipeline isolation test \
/ Recall@k, MRR, NDCG@10 per query type \
──────────────────────────────────────────────────
Layer 1: MTEB-NL External Benchmark
(Standard Dutch retrieval tasks, published scores)
/ Embedding model validation \
/ External reproducibility, model comparison \
──────────────────────────────────────────────────────
- Layer 1 validates the component (embedding model) against external baselines
- Layer 2 validates the retrieval system against domain-specific ground truth
- Layer 3 validates the complete pipeline including generation quality and safety
A failure at Layer 1 (poor embedding model) would propagate to Layers 2 and 3. A failure at Layer 2 (retrieval misses) might not appear at Layer 3 if the LLM compensates — which is why isolated retrieval measurement is essential. A failure at Layer 3 (poor generation despite good retrieval) indicates generation-layer issues rather than retrieval problems.
This layered approach aligns with the BEIR methodology (Thakur et al., 2021) recommendation that retrieval evaluation should be separated from downstream task evaluation, and extends it with domain-specific adaptation.
11. Is This a Best-in-Class Architecture?
What "Best-in-Class" Means
For a production medical RAG system deployed in a hospital context, best-in-class means:
- Retrieval quality comparable to or exceeding published benchmarks for multilingual medical retrieval
- Safety guarantees validated through systematic adversarial testing
- Evaluation rigour sufficient for academic publication or regulatory submission
- Operational robustness with graceful degradation, monitoring, and alerting
- Architectural alignment with current SOTA RAG patterns documented in peer-reviewed surveys
Honest Assessment
The ZOL system is best-in-class for its operational constraints (local inference, single-instance deployment, zero external API dependencies for embeddings, Dutch language requirement). Within these constraints, the architecture makes excellent decisions:
- BGE-M3 is the strongest available model on Ollama with Dutch support
- The 12-layer safety architecture exceeds most published medical RAG systems
- Contextual embeddings with page summaries implement the full Anthropic pattern
- The frozen taxonomy approach prevents the "noisy graph" problem that plagues naive KG-RAG systems
- The 5-tier LLM routing efficiently allocates model capacity
The gap to absolute best-in-class has narrowed significantly following the Wave 1-4 improvements. The most significant remaining gaps are:
| Gap | Severity | Impact |
|---|---|---|
| No user study | Critical | Improved metrics ≠ improved patient experience |
| No inter-annotator agreement | Significant | Ground truth quality unvalidated |
| No learned sparse retrieval (SPLADE) | Medium | 5-15% potential retrieval improvement |
| No ColBERT/late interaction | Medium | Under-utilising BGE-M3 multi-vector capability |
| No Self-RAG | Medium | No self-critique during generation |
| No graph reasoning | Medium | Limited to lookup; cannot discover indirect relationships |
| No domain-adapted embedding | Medium | General-purpose model for specialised domain |
Previously critical gaps now addressed:
No external benchmark evaluation→ MTEB-NL/BEIR-NL evaluated (W1-4)No attribution verification→ Implemented (W1-3)No context filtering→ FILCO implemented, feature-flagged (W2-1)No adversarial red-teaming→ 40-case harness (W3-1) + guardrails model (W3-2)No statistical significance→ Bootstrap confidence intervals (W1-1)No CRAG→ Ternary quality gate with refinement (W4-2)No adaptive retrieval→ Intent-driven strategy routing (W4-1)
12. Roadmap to Best-in-Class
The following roadmap prioritises improvements by impact-to-effort ratio, grouped into three horizons:
Horizon 1: Evaluation Rigour — ✅ Completed (Wave 1)
| Item | Status | Reference |
|---|---|---|
| External benchmark evaluation on MTEB-NL / BEIR-NL | ✅ W1-4 | BGE-M3 score: 60.0 |
| Bootstrap confidence intervals on evaluation metrics | ✅ W1-1 | 10,000 bootstrap iterations |
| Safety pipeline FP/FN rate measurement | ✅ W1-2 | safety_evaluation.py |
| Attribution verification service | ✅ W1-3 | NLI-based entailment scoring |
| Inter-annotator agreement for golden questions | ❌ Open | Requires 2 domain experts |
Horizon 2: Retrieval Quality — ✅ Mostly Completed (Waves 2 + 4)
| Item | Status | Reference |
|---|---|---|
| FILCO-style context filtering at query time | ✅ W2-1 | Feature-flagged (context_filter_enabled) |
| Retrieval confidence scoring (calibrated abstention) | ✅ W2-2 | RetrievalConfidenceScorer |
| Domain-specific retrieval benchmark (200+ queries) | ✅ W2-3 | retrieval_benchmark.py |
| Adaptive retrieval (intent-driven strategy routing) | ✅ W4-1 | vector_only for navigational |
| Corrective RAG (ternary quality gate + refinement) | ✅ W4-2 | ADR-0038, feature-flagged |
| ColBERT retrieval mode in BGE-M3 | ❌ Open | Multi-vector retrieval |
| SNOMED CT integration (ADR-0016) | ✅ Phase C | 356K concepts, synonym expansion, FINDING_SITE routing, graph enrichment, 15/15 golden questions |
Horizon 3: Safety Hardening — ✅ Completed (Wave 3)
| Item | Status | Reference |
|---|---|---|
| Systematic red-teaming harness | ✅ W3-1 | 40 adversarial test cases |
| Guardrails model (Llama Guard 3) | ✅ W3-2 | Feature-flagged (guardrails_enabled) |
| Anomaly threshold validation (ROC analysis) | ✅ W3-3 | anomaly_threshold_validation.py |
Horizon 4: Remaining Gaps (Future Work)
| Item | Effort | Impact | Reference |
|---|---|---|---|
| User study (50+ patients, task-based evaluation) | 4 weeks | Validates real-world impact | Standard medical AI |
| Inter-annotator agreement (recruit 2 domain experts) | 1 week | Validates ground truth quality | Tsatsaronis et al. (2015) |
| Learned sparse retrieval (SPLADE or equivalent) | 3 weeks | +5-15% BEIR improvement | Formal et al. (2022) |
| ColBERT/late interaction retrieval mode | 2 weeks | Multi-vector retrieval | Khattab & Zaharia (2020) |
| Self-RAG (self-critique during generation) | 4 weeks | Improved faithfulness | Asai et al. (2024) |
| Agentic RAG with dynamic strategy selection | 6 weeks | Full Modular RAG | Singh et al. (2025) |
| Domain-adapted embedding fine-tuning | 3 weeks | +3-5% retrieval quality | Thakur et al. (2021) |
13. Conclusion
The ZOL Intelligent Search system represents a mature, production-grade Advanced RAG architecture with significant Modular RAG characteristics, making consistently good engineering trade-offs within its operational constraints. The 10-stage retrieval pipeline (now including adaptive strategy selection and CRAG quality gate), 12-layer safety architecture (now validated via red-teaming and guardrails model), and frozen taxonomy approach are architecturally sound and well-documented through 38 Architecture Decision Records.
Following the Wave 1-4 improvement programme, the system has closed 8 of the 11 originally identified gaps:
| Wave | Implemented | Key Achievement |
|---|---|---|
| W1 | Evaluation rigour | External benchmarks, bootstrap CIs, attribution verification, safety FP/FN |
| W2 | Retrieval quality | FILCO context filtering, confidence scoring, domain-specific benchmark |
| W3 | Safety hardening | Red-teaming harness, Llama Guard guardrails, ROC threshold validation |
| W4 | Adaptive pipeline | Intent-driven strategy routing, CRAG ternary quality gate |
The system now implements 18 of 26 identified SOTA techniques (up from 10/18 in the initial assessment). The comparative table score has improved from 55% to 69% coverage.
The remaining critical path to demonstrable best-in-class status is no longer primarily evaluative but requires external validation: a user study with real patients (critical), inter-annotator agreement for golden questions (significant), and the remaining retrieval improvements (SPLADE, ColBERT, Self-RAG) that represent moderate engineering effort with demonstrated academic impact. The system has crossed the threshold from "appears strong" to "architecturally strong with partial empirical evidence" — the next step is "demonstrably strong through external validation."
References
- Ammann, P. J. L., et al. (2025). Question decomposition for retrieval-augmented generation. ACL 2025 SRW. https://arxiv.org/abs/2507.00355
- Anthropic. (2024). Introducing contextual retrieval. https://www.anthropic.com/news/contextual-retrieval
- Asai, A., et al. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024. https://arxiv.org/abs/2310.11511
- Bruch, S., et al. (2023). An analysis of fusion functions for hybrid retrieval. ACM TOIS. https://doi.org/10.1145/3596512
- Chen, J., et al. (2024). BGE M3-Embedding. https://arxiv.org/abs/2402.03216
- Cho, J. & Garcia-Molina, H. (2003). Estimating frequency of change. ACM TOIT. https://doi.org/10.1145/857166.857170
- Cormack, G. V., et al. (2009). Reciprocal rank fusion. SIGIR 2009. https://doi.org/10.1145/1571941.1572114
- Edge, D., et al. (2024). From local to global: A graph RAG approach. Microsoft Research.
- Fan, W., et al. (2024). A survey on RAG meeting LLMs. https://arxiv.org/abs/2405.06211
- Formal, T., et al. (2022). From distillation to hard negative sampling: Making sparse neural IR models more effective. SIGIR 2022. https://arxiv.org/abs/2205.04733
- Gao, T., et al. (2023). Enabling large language models to generate text with citations. EMNLP 2023. https://arxiv.org/abs/2305.14627
- Gao, Y., et al. (2024). Retrieval-augmented generation for LLMs: A survey. https://arxiv.org/abs/2312.10997
- Günther, M., et al. (2024). Late chunking. https://arxiv.org/abs/2409.04701
- Jeong, S., et al. (2024). Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. NAACL 2024. https://arxiv.org/abs/2403.14403
- Jiang, H., et al. (2023). LLMLingua: Compressing prompts for accelerated inference of large language models. EMNLP 2023. https://arxiv.org/abs/2310.05736
- Karpukhin, V., et al. (2020). Dense passage retrieval. EMNLP 2020. https://arxiv.org/abs/2004.04906
- Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. SIGIR 2020. https://arxiv.org/abs/2004.12832
- Kusupati, A., et al. (2022). Matryoshka representation learning. NeurIPS 2022. https://arxiv.org/abs/2205.13147
- Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401
- Liao, H., et al. (2024). AmpleGCG-Plus. https://arxiv.org/abs/2410.22143
- Malkov, Y. A. & Yashunin, D. A. (2018). Efficient and robust ANN search using HNSW graphs. IEEE TPAMI. https://doi.org/10.1109/TPAMI.2018.2889473
- Min, S., et al. (2019). Multi-hop reading comprehension through question decomposition and rescoring. ACL 2019.
- Nielsen, J. (1993). Usability engineering. Academic Press.
- Nogueira, R. & Cho, K. (2019). Passage re-ranking with BERT. https://arxiv.org/abs/1901.04085
- Olston, C. & Najork, M. (2010). Web crawling. FnTIR. https://doi.org/10.1561/1500000017
- Patel, D., et al. (2025). A framework to assess clinical safety and hallucination rates of LLMs. npj Digital Medicine. https://www.nature.com/articles/s41746-025-01670-7
- Peng, B., et al. (2025). Retrieval-augmented generation with graphs (GraphRAG). https://arxiv.org/abs/2501.00309
- Perez, E., et al. (2022). Red teaming language models with language models. EMNLP 2022. https://arxiv.org/abs/2202.03286
- Pradeep, R., et al. (2023). RankVicuna: Zero-shot listwise document reranking with open-source LLMs. https://arxiv.org/abs/2309.15088
- Reimers, N. & Gurevych, I. (2019). Sentence-BERT. EMNLP 2019. https://arxiv.org/abs/1908.10084
- Ren, J., et al. (2023). Self-evaluation improves selective generation in LLMs. https://arxiv.org/abs/2312.09300
- Robertson, S. & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. FnTIR. https://doi.org/10.1561/1500000019
- Sarmah, B., et al. (2024). HybridRAG. https://arxiv.org/abs/2408.04948
- Singh, A., et al. (2025). Agentic retrieval-augmented generation: A survey. https://arxiv.org/abs/2501.09136
- Thakur, N., et al. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. NeurIPS 2021. https://arxiv.org/abs/2104.08663
- Trivedi, H., et al. (2023). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. ACL 2023. https://arxiv.org/abs/2212.10509
- Vake, L., et al. (2025). HyPE-RAG. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335
- Wang, Z., et al. (2023). Learning to filter context for RAG (FILCO). https://arxiv.org/abs/2311.08377
- Yan, S., et al. (2024). Corrective retrieval augmented generation. https://arxiv.org/abs/2401.15884
- Zhang, X., et al. (2023). MIRACL: A multilingual retrieval dataset covering 18 diverse languages. TACL. https://arxiv.org/abs/2210.09984
- Zheng, L., et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
- Zou, A., et al. (2023). Universal and transferable adversarial attacks on aligned language models. https://arxiv.org/abs/2307.15043