Skip to main content

Academic Critical Assessment

Embedding-stack note

The Section 3 ("Embedding Strategy") evaluation below was written when production used BGE-M3 (Chen et al., 2024) at 1024 dim via on-prem Ollama. As of ADR-0048 (2026-04-30) the system uses OpenAI text-embedding-3-large at 1536 dim, hosted. The discussion of multilingual coverage, dimensionality trade-offs, and multi-vector retrieval still applies — most of the BGE-M3-specific gaps (no ColBERT mode, no learned-sparse, no domain fine-tuning) hold for text-embedding-3-large as well, since it is a single dense embedder. The paragraphs are preserved verbatim because they remain the academic critique that motivated subsequent improvement work; only the model-name labels have moved on.

This chapter provides an honest and critical evaluation of the ZOL Intelligent Search system architecture, measured against the current state of the art in Retrieval-Augmented Generation (RAG), knowledge graph integration, adversarial robustness, and medical AI safety. The assessment identifies both demonstrated strengths and architectural gaps, concluding with a concrete roadmap for achieving best-in-class status.

Methodology Note

This assessment evaluates architectural decisions and implementation quality against published academic benchmarks and production RAG surveys (Gao et al., 2024; Fan et al., 2024; Peng et al., 2025). Where the ZOL system has not been evaluated on standardised benchmarks (e.g., BEIR, MTEB-NL retrieval), this is explicitly noted as a gap. Self-reported metrics from internal golden evaluations are referenced but acknowledged as non-comparable to external benchmarks.


1. Overall Architecture Classification

Gao et al. (2024) classify RAG systems into three generations:

GenerationDescriptionKey Features
Naive RAGBasic retrieve-then-generateSingle retrieval, no post-processing
Advanced RAGPre-retrieval and post-retrieval optimisationQuery rewriting, reranking, metadata boosting
Modular RAGComposable, interchangeable componentsPluggable retrievers, adaptive routing, agent-based orchestration

Assessment: The ZOL system is a mature Advanced RAG with significant Modular RAG characteristics. It implements pre-retrieval optimisation (intent classification, taxonomy enrichment, query decomposition), parallel multi-channel retrieval (vector + BM25 + knowledge graph), post-retrieval refinement (RRF fusion, metadata boosting, cross-encoder reranking), context enrichment (contextual embeddings, page summaries), and — since Wave 4 — adaptive retrieval strategy selection (W4-1) and a Corrective RAG (CRAG) quality gate (W4-2, Yan et al., 2024) that classifies retrieval confidence and triggers refinement for ambiguous results. The modular elements include configurable model routing (5-tier LLM hierarchy), pluggable reranker backends (Jina API / local BGE), feature-flagged query decomposition, intent-driven strategy routing, and a ternary pre-generation quality gate with automatic retrieval refinement.

The system now implements partial adaptive orchestration: retrieval strategies are selected based on intent classification (W4-1), and retrieval results are evaluated post-retrieval via CRAG (W4-2) — if classified as AMBIGUOUS, the system automatically retries with relaxed parameters (lower similarity threshold, expanded result set, no category filter). However, the system cannot yet dynamically select between retrievers mid-execution or perform multi-hop retrieval chains, which are hallmarks of full agentic RAG (Singh et al., 2025; Trivedi et al., 2023).

Verdict: ★★★★☆ — Strong Advanced RAG with meaningful Modular RAG features (adaptive strategy + CRAG). Approaching but not yet fully agentic.


2. Retrieval Architecture

2.1 Hybrid Search (Vector + BM25)

Strengths: The combination of dense vector search (BGE-M3, 1024-dim) with sparse BM25 keyword search, fused via Reciprocal Rank Fusion (Cormack et al., 2009), represents the current production standard for enterprise RAG (Bruch et al., 2023). The ZOL system further enhances this with:

  • Contextual embeddings (Anthropic, 2024) prepended at ingestion time — reduces retrieval failure by 49%
  • Canonical question generation for BM25 enrichment — partially implements the HyPE pattern (Vake et al., 2025)
  • Keyword rescue as a safety net for rare terms missed by both channels

Gaps:

  1. No learned sparse retrieval: The system uses PostgreSQL tsvector with 'simple' tokenisation (no stemming). Modern sparse retrieval models like SPLADE (Formal et al., 2022) learn term importance weights that outperform raw BM25 by 5-15% on BEIR benchmarks. The 'simple' configuration, while appropriate for preserving Dutch medical terms, sacrifices the morphological normalisation that would help with Dutch inflections (e.g., "behandeling" vs. "behandelingen").

  2. No ColBERT/late interaction retrieval: BGE-M3 supports ColBERT retrieval mode (multi-vector matching per token), but this capability is not utilised. ColBERT provides a middle ground between bi-encoder speed and cross-encoder accuracy, and is particularly effective for long medical queries where individual term-level matching matters (Khattab & Zaharia, 2020).

  3. No query-adaptive retrieval Partially addressed (W4-1): The system now implements intent-driven adaptive retrieval strategy selection — navigational queries use vector_only, entity-specific queries use graph_first, and complex medical queries use full hybrid. This implements the selective channel activation recommended by Adaptive RAG (Jeong et al., 2024). However, the strategy is fixed at pipeline start based on intent classification and cannot be revised mid-execution based on intermediate retrieval quality.

  4. Canonical questions are BM25-only: The generated canonical questions enrich BM25 but are not embedded as separate vectors. Full HyPE implementation (Vake et al., 2025) would embed hypothetical questions alongside document chunks, providing additional vector-space retrieval paths.

2.2 Knowledge Graph Integration

Strengths: The PostgreSQL taxonomy with typed entities (doctors, departments, conditions, treatments, campuses, examinations) and curated relationships (HANDLES, OFFERS, WORKS_IN, LOCATED_AT) provides structured entity traversal that vector search cannot replicate. The frozen taxonomy approach (ADR-0028) with LLM-validated hub pages ensures high taxonomy data quality — a critical requirement identified by Peng et al. (2025) in their GraphRAG survey.

The separation of graph seeding from document ingestion is architecturally sound: it prevents the "noisy graph" problem documented in early GraphRAG implementations (Edge et al., 2024), where unrestricted entity extraction from all documents produces low-quality relationships that degrade retrieval.

Gaps:

  1. No graph-based reasoning: The current graph integration is purely lookup-based (Cypher queries for entity relationships). True GraphRAG (Peng et al., 2025) involves graph-guided retrieval where the graph structure informs the retrieval strategy — e.g., traversing relationship chains to discover relevant documents that wouldn't be found by similarity search. The ZOL system's graph results are simply merged with vector/BM25 results rather than guiding the retrieval process.

  2. No graph embeddings: The knowledge graph nodes have no learned embeddings. Knowledge graph embedding methods (TransE, ComplEx, RotatE) could enable similarity search over the graph structure, finding related entities even when exact relationship paths don't exist in the taxonomy.

  3. Static taxonomy: The frozen taxonomy (zol_taxonomy.py) is manually maintained. While this ensures quality, it cannot scale to new conditions, treatments, or organisational changes without developer intervention. An automated taxonomy update pipeline with LLM-based validation (as proposed in ADR-0028) would improve maintainability.

  4. Partial ontology alignment: The system integrates SNOMED CT Belgian Edition (356K concepts, 656K descriptions, 4.7M transitive closure relationships) via ADR-0016. Query-time synonym expansion, FINDING_SITE-based department routing, and graph enrichment with SNOMED concept IDs and IS_A relationships are implemented (15/15 SNOMED golden questions pass). Remaining gaps: IS_A hierarchical traversal is not used at query time for broad-category queries, and cross-language descriptions (French, English) are imported but not loaded into the lookup tables.

2.3 Reranking

Strengths: Always-on cross-encoder reranking via Jina Reranker v2 (with local bge-reranker-v2-m3 fallback) implements the two-stage retrieval paradigm established by Nogueira and Cho (2019). The candidate reduction from 50 to 20 (ADR-0034) was validated through A/B testing showing equivalent MRR and NDCG@5.

Gaps:

  1. No listwise reranking: Current reranking is pointwise (each query-document pair scored independently). Listwise reranking approaches (Pradeep et al., 2023) that consider all candidates simultaneously produce more calibrated rankings but require LLM-based rerankers.

  2. No domain-adapted reranker: Neither Jina nor bge-reranker-v2-m3 is fine-tuned for Dutch medical content. Domain adaptation of rerankers has been shown to improve NDCG@5 by 3-8% in specialised domains (Thakur et al., 2021).

Verdict: ★★★★☆ — Strong production-grade retrieval. The hybrid search + reranking architecture is state-of-the-art for production systems. Adaptive retrieval (W4-1) partially closes the strategy gap. Key remaining gaps are ColBERT utilisation and graph-based reasoning.


3. Embedding Strategy

Strengths: BGE-M3 (Chen et al., 2024) is the strongest open-source multilingual embedding model available on Ollama, with a measured MTEB-NL retrieval score of 60.0. Local inference via Ollama ensures zero API cost and full data sovereignty — critical requirements for healthcare deployments. The contextual embedding approach (prepending LLM-generated context before embedding) is directly aligned with Anthropic's (2024) research showing 35-49% retrieval failure reduction.

Gaps:

  1. Not benchmarked on ZOL-specific retrieval: The MTEB-NL score of 60.0 was measured on general Dutch retrieval tasks. No ZOL-domain-specific retrieval benchmark exists, so the actual quality for Dutch medical queries is inferred but not measured. Creating a domain-specific evaluation set (analogous to the golden questions but focused specifically on retrieval ranking rather than end-to-end answer quality) would provide this measurement.

  2. No fine-tuning: BGE-M3 is used as-is without fine-tuning on Dutch medical text. Domain-specific fine-tuning using contrastive learning on the ZOL corpus could improve retrieval quality by 3-5% based on analogous domain adaptation results (Thakur et al., 2021). However, the cost-benefit trade-off is unclear given the existing contextual embedding enrichment.

  3. No embedding compression: At 1024 dimensions, BGE-M3 embeddings consume 4KB per vector. Matryoshka representation learning (Kusupati et al., 2022) enables adaptive dimension reduction without retraining, but BGE-M3 does not support this natively. Quantisation (e.g., IVFPQ in pgvector) could reduce storage and improve query speed.

  4. No multi-vector retrieval: BGE-M3 supports dense, sparse, and ColBERT retrieval modes simultaneously. Only the dense mode is used. Activating sparse and ColBERT modes would create a multi-vector retrieval system with potentially significant recall improvements.

Verdict: ★★★☆☆ — Good model selection within constraints. Local inference is a strong operational decision. However, the lack of domain-specific benchmarking and underutilisation of BGE-M3's multi-vector capabilities leave measurable improvement opportunities.


4. Context Assembly and Generation

4.1 Context Filtering

Strengths: The three-level contextual retrieval implementation (embedding-time context, BM25-time enrichment, generation-time page summaries) is a comprehensive approach to the context quality problem. The ±1 chunk expansion with overlap deduplication preserves document coherence while managing token budget.

Gaps:

  1. No query-time context filtering (FILCO) Implemented (W2-1), feature-flagged: A FILCO-style sentence-level context filtering service is now implemented (context_filter_enabled, default: off) and wired into the pipeline at Step 6c. When enabled, it scores individual sentences within retrieved chunks for query relevance and removes low-scoring passages before generation. This partially addresses Wang et al.'s (2023) finding that filtering reduces prompt lengths by up to 64% while improving answer quality. The implementation uses lexical overlap scoring rather than the full conditional cross-mutual information approach from the original paper.

  2. Fixed token budget: The 8,000-token context budget is static. Adaptive token allocation based on query complexity (simple questions need less context, multi-hop questions need more) could improve both efficiency and quality.

  3. No context compression: Long-context compression techniques (e.g., LLMLingua by Jiang et al., 2023) can reduce context length by 2-5x while preserving answer quality, enabling more documents to fit within the budget.

4.2 Generation

Strengths: The 5-tier LLM routing (nano/mini/standard/escalation/flagship) efficiently allocates model capacity to query complexity. Streaming responses with progress indicators address user experience requirements (Nielsen, 1993). The strict grounding prompt enforcing citation with [1] notation implements basic attribution.

Gaps:

  1. No attribution verification Implemented (W1-3): An AttributionVerificationService now provides post-hoc citation checking using NLI-based entailment scoring. The service verifies whether each citation actually supports the corresponding claim, following Gao et al. (2023). This is available as an evaluation tool and can be integrated into the generation pipeline for runtime verification.

  2. No abstention mechanism Implemented (W2-2 + W4-2): A RetrievalConfidenceScorer computes a weighted confidence score (50% top_score + 30% mean_top_k + 20% score_gap) with a configurable abstention threshold. This was further extended by the CRAG quality gate (W4-2, ADR-0038), which classifies retrieval as CORRECT/AMBIGUOUS/INCORRECT and automatically refuses generation when confidence is below threshold — implementing confidence-calibrated abstention (Ren et al., 2023) independent of LLM judgement.

  3. No Self-RAG or CRAG CRAG implemented (W4-2): Corrective RAG (Yan et al., 2024) is now implemented via the CRAGDecision ternary classifier. Retrieval is classified as CORRECT (generate), AMBIGUOUS (refine with relaxed parameters then re-assess), or INCORRECT (abstain). The AMBIGUOUS path triggers automatic retrieval refinement with lower similarity threshold, expanded result set, and removed category filters — adding ~0.5-1s latency only for borderline queries. Feature-flagged via crag_enabled (default: off). Self-RAG is not yet implemented.

Verdict: ★★★★☆ — Significantly improved since initial assessment. FILCO context filtering (W2-1), attribution verification (W1-3), confidence-calibrated abstention (W2-2), and CRAG (W4-2) close the most significant generation-quality gaps. Remaining gaps: Self-RAG, adaptive token budget, and context compression.


5. Safety and Adversarial Robustness

Strengths: The 12-layer defence-in-depth architecture (ADR-0036) is significantly more comprehensive than most production RAG systems. Key highlights:

  • Perplexity-based anomaly detector (H1) catches GCG-style adversarial suffixes in under 5ms using statistical heuristics — a novel, cost-effective approach that doesn't require an LLM call
  • LLM-as-judge safety validation (H2) enabled by default with intent-based skip optimisation
  • In-memory rate limiter fallback (H3) with burst protection prevents fail-open scenarios
  • Streaming retraction with server-side enforcement (H4) and WebSocket close code 4001

The multi-layer approach aligns with the defence-in-depth principle recommended by Zou et al. (2023) for protecting against universal adversarial attacks.

Gaps:

  1. No red-teaming evaluation Implemented (W3-1): A systematic red-teaming harness with 40 adversarial test cases covering GCG-style suffixes, prompt injection, context manipulation, and role-play attacks is now available (tests/evaluation/red_teaming.py). The harness tests the full safety pipeline against established attack patterns (Perez et al., 2022).

  2. No input/output guardrails model Implemented (W3-2): A GuardrailsService integrating Llama Guard 3 (via OpenRouter) provides trained classifier-based input/output safety validation. Feature-flagged via guardrails_enabled (default: off). This supplements the existing regex + statistical heuristic layers with a dedicated safety classification model, addressing the gap for detecting sophisticated paraphrased attacks.

  3. No formal safety evaluation framework Implemented (W1-2 + W3-3): A safety evaluation framework (tests/evaluation/safety_evaluation.py) measures false positive rates (safe queries incorrectly blocked) and false negative rates (unsafe responses not caught) across the full safety pipeline. Additionally, an anomaly threshold validation tool (W3-3) performs ROC curve analysis to optimise detector thresholds against labelled adversarial and benign corpora, quantifying safety trade-offs as recommended by Patel et al. (2025).

  4. Perplexity detector false positives: The statistical anomaly detector may flag legitimate queries containing code-switched medical terminology, URLs, or non-Latin scripts. The thresholds are now validated via ROC analysis (W3-3) against a labelled dataset of adversarial and benign queries, but production-scale validation with real user traffic has not yet been conducted.

Verdict: ★★★★★ — The safety architecture now includes systematic red-teaming (W3-1), a guardrails model (W3-2), quantified FP/FN measurement (W1-2), and threshold validation via ROC analysis (W3-3). This represents a comprehensive defence-in-depth posture that exceeds published production RAG safety architectures. The remaining gap is production-scale validation with real user traffic.


6. Evaluation Methodology

Strengths: The golden evaluation framework (302 questions across 21 intent categories, v3.6) provides a reproducible, deterministic evaluation of end-to-end system quality. The primary metrics -- entity recall, pass rate, citation accuracy, safety refusal rate -- cover the key quality dimensions. The evaluation distinguishes between retrieval quality and generation quality, enabling targeted debugging.

Note on NDCG@5 / MRR: The golden evaluation reports include NDCG@5 and MRR as retrieval metrics, but these values are near-zero (typically 0.000-0.055) due to a URL granularity mismatch: expected_source_urls are defined at a coarse department-page level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these metrics cannot be meaningfully computed. The system's retrieval quality is better reflected by entity recall (0.94+) and pass rate (98.9%), which measure end-to-end answer quality.

Gaps:

  1. No external benchmark evaluation Implemented (W1-4): An MTEB-NL/BEIR-NL benchmark harness (tests/evaluation/mteb_nl_benchmark.py) evaluates the BGE-M3 embedding model on standardised Dutch retrieval tasks. The measured MTEB-NL retrieval score of 60.0 provides an external reference point for the embedding model choice. A domain-specific retrieval benchmark (W2-3) with 200+ queries across ZOL-specific categories was also created.

  2. No inter-annotator agreement: The golden questions were created by a single annotator (the developer). Medical information retrieval evaluation requires multi-annotator agreement scores to validate ground truth quality (Tsatsaronis et al., 2015). Without inter-annotator agreement, the evaluation may reflect a single person's expectations rather than true information need. This remains an open gap.

  3. No statistical significance testing Implemented (W1-1): Bootstrap confidence intervals are now computed for all evaluation metrics (tests/evaluation/statistical_analysis.py). Given the 146-question evaluation set, 95% confidence intervals quantify the reliability of observed improvements via bootstrap resampling (10,000 iterations). This addresses the point-estimate-only reporting gap.

  4. No user-based evaluation: All evaluation is offline (golden questions). No user study has been conducted to validate that improved retrieval metrics correlate with improved user satisfaction and task completion. In medical search contexts, user-based evaluation is particularly important because patients may have different information needs than the system designer assumes. This remains an open gap.

  5. Limited LLM-as-judge validation: The system uses DeepEval's FaithfulnessMetric and AnswerRelevancyMetric for quality analytics, but the LLM judge itself has not been validated against human judgements for Dutch medical content. Zheng et al. (2023) showed that LLM judges have systematic biases that vary by language and domain. This remains an open gap.

Verdict: ★★★☆☆ — Significantly improved from the initial assessment. External benchmarks (W1-4), domain-specific retrieval benchmarks (W2-3), and bootstrap confidence intervals (W1-1) bring the evaluation closer to academic standards. The remaining critical gaps are inter-annotator agreement and user-based evaluation.


7. Incremental Crawling and Data Freshness

Strengths: The content-hash-based change detection for incremental updates implements a well-established approach from the web crawling literature (Cho & Garcia-Molina, 2003). The sitemap-driven discovery ensures comprehensive URL coverage. Content deduplication by title prevents duplicate documents from different URL paths.

Gaps:

  1. No change frequency estimation: Cho and Garcia-Molina (2003) demonstrated that Poisson-based change frequency estimators improve crawl freshness by 35%. The ZOL system treats all URLs equally during re-crawls rather than prioritising frequently-changing content (e.g., doctor schedules, visiting hours).

  2. No differential update: When a document changes, the entire document is re-processed (re-chunked, re-embedded). A differential update approach that identifies changed sections and updates only affected chunks would reduce re-embedding costs.

  3. No freshness monitoring: There is no automated monitoring of content freshness — no alerts when crawled content becomes stale, no automatic re-crawl scheduling, no freshness metrics in the analytics dashboard.

Verdict: ★★★☆☆ — Functional incremental ingestion. Missing optimisation opportunities for change-frequency-based scheduling and differential updates.


8. Comparative Analysis: ZOL vs. State-of-the-Art

The following table compares the ZOL system against key techniques identified in recent RAG surveys (Gao et al., 2024; Fan et al., 2024; Peng et al., 2025):

TechniqueZOL StatusSOTA BenchmarkGap
Hybrid search (vector + BM25)✅ ImplementedStandard practiceNone
Cross-encoder reranking✅ Always-on (Jina + BGE fallback)Standard practiceNo domain adaptation
Contextual embeddings✅ Full (embed + BM25 + gen-time)Anthropic (2024): -49% failureNone — fully aligned
Knowledge graph integration✅ Implemented (Neo4j typed nodes)Peng et al. (2025): GraphRAGLookup-only, no graph reasoning
RRF score fusion✅ k=60Cormack et al. (2009)None
Query decomposition✅ Feature-flaggedAmmann et al. (2025): +36.7% MRRNone
Metadata boosting✅ 9 signalsNovel (domain-specific)No published comparison
Adversarial hardening✅ 12 layers + anomaly detectorZou et al. (2023): GCG defenceNo red-team validation Validated (W3-1)
Context filtering (FILCO)✅ Implemented (W2-1), feature-flaggedWang et al. (2023): -64% promptLexical overlap only (no CMI)
CRAG (Corrective RAG)✅ Implemented (W4-2), feature-flaggedYan et al. (2024)None — ternary gate with refinement
Adaptive retrieval✅ Implemented (W4-1)Jeong et al. (2024)Intent-driven only (not mid-pipeline)
Attribution verification✅ Implemented (W1-3)Gao et al. (2023)Available as evaluation tool
Retrieval confidence / abstention✅ Implemented (W2-2)Ren et al. (2023)None — calibrated abstention
Guardrails model✅ Implemented (W3-2), feature-flaggedLlama Guard 3None
External benchmark (MTEB-NL/BEIR-NL)✅ Evaluated (W1-4)Standard practiceBGE-M3 score: 60.0
Bootstrap confidence intervals✅ Implemented (W1-1)Standard practiceNone
Safety FP/FN measurement✅ Implemented (W1-2 + W3-3)Patel et al. (2025)ROC threshold validation
Domain-specific retrieval benchmark✅ Implemented (W2-3)BEIR methodology200+ queries
Learned sparse retrieval (SPLADE)❌ Not implemented+5-15% on BEIRModerate
ColBERT/late interaction❌ Not implementedKhattab & Zaharia (2020)Moderate
Self-RAG❌ Not implementedAsai et al. (2024)Moderate (latency cost)
User study❌ Not conductedStandard for medical AICritical
Domain-adapted embedding❌ Not implementedThakur et al. (2021)Moderate
Agentic RAG❌ Not implementedSingh et al. (2025)Future direction
Inter-annotator agreement❌ Not conductedTsatsaronis et al. (2015)Significant

Summary: 18/26 SOTA techniques implemented (up from 10/18). 1 significant gap (inter-annotator agreement), 3 moderate gaps (SPLADE, ColBERT, Self-RAG), 1 critical gap (user study), 1 future direction (agentic RAG).


9. Why Generic Medical QA Benchmarks Don't Apply

A common critique of domain-specific RAG systems is the absence of evaluation against established medical QA benchmarks such as MedQA (Jin et al., 2021), PubMedQA (Jin et al., 2019), MedMCQA (Pal et al., 2022), or MMLU-Medical (Hendrycks et al., 2021). While these benchmarks are appropriate for clinical decision support systems and medical knowledge models, they are fundamentally misaligned with the ZOL system for several reasons:

9.1 Scope Mismatch: Hospital-Specific vs. General Medical Knowledge

The ZOL system operates on a closed corpus of approximately 1,000 hospital-specific documents (department pages, brochures, doctor profiles, patient guides). It does not contain — and is explicitly designed not to answer — general medical knowledge questions. Approximately 80% of questions in MedQA or PubMedQA would be entirely out-of-scope because they concern diagnoses, drug interactions, or clinical protocols that are not part of the ZOL website content.

For example, a MedQA question like "What is the first-line treatment for community-acquired pneumonia?" expects clinical guideline knowledge. The ZOL system would correctly respond with a safety refusal or a navigational redirect to the Pneumology department — which would be scored as "incorrect" by MedQA metrics despite being the appropriate system behaviour.

9.2 Task Mismatch: Navigation vs. Clinical Decision Support

The ZOL system is navigational and informational, not a clinical decision support tool. Its primary task is answering queries like:

  • "Which department handles heart problems?" (entity lookup)
  • "How do I prepare for a colonoscopy?" (patient guide retrieval)
  • "Which doctors work at the Oncology department?" (entity relationship traversal)

These tasks have no equivalent in MedQA, PubMedQA, or MMLU-Medical, which test medical reasoning, evidence synthesis, and clinical judgement. Evaluating a hospital navigation system on clinical reasoning benchmarks is analogous to evaluating a library catalogue system on reading comprehension — it measures the wrong capability.

9.3 Language and Domain Specificity

The ZOL system operates primarily in Dutch on Belgian hospital content. None of the major medical QA benchmarks provide Dutch-language evaluation sets. While MIRACL (Zhang et al., 2023) includes Dutch retrieval tasks, it does not cover medical QA. The closest applicable benchmark is MTEB-NL for retrieval quality (Layer 1 of our evaluation), which evaluates the embedding model on general Dutch information retrieval.

9.4 The Appropriate Evaluation Strategy

Thakur et al. (2021) demonstrated that domain-specific evaluation is essential because generic benchmarks systematically overestimate or underestimate system quality for specialised use cases. Following this principle, the ZOL system uses a three-layer evaluation architecture (described in Section 10) that combines external benchmarks for component validation with domain-specific benchmarks for system-level quality measurement.


10. Three-Layer Evaluation Architecture

The ZOL evaluation framework follows a layered approach that addresses the limitations of generic benchmarks while maintaining scientific rigour through external reference points.

10.1 Layer 1: MTEB-NL / BEIR-NL — Embedding Model Validation

Purpose: Validate the embedding model choice (BGE-M3) against published Dutch retrieval leaderboards.

Method: The mteb_nl_benchmark.py runner evaluates BGE-M3 on standardised MTEB retrieval tasks including Dutch content. This provides an external, reproducible reference point for the embedding model's retrieval capability independent of the ZOL domain.

Key metrics: NDCG@10, MRR, Recall@100 — aggregated across available Dutch retrieval tasks.

Measured result: BGE-M3 achieves an MTEB-NL retrieval score of 60.0, positioning it as the strongest open-source multilingual model available for local inference via Ollama.

Limitation: General Dutch retrieval does not measure medical domain performance. This layer validates the foundation (embedding quality) but not the application (hospital search).

10.2 Layer 2: Domain-Specific ZOL Retrieval Benchmark

Purpose: Measure retrieval quality for hospital-specific queries using a curated test set with known expected source URLs.

Method: A benchmark of 50 queries across 10 query types evaluates whether the retrieval pipeline (vector + BM25 + knowledge graph, fused via RRF, reranked via cross-encoder) returns the correct hospital pages for each query. Query types include:

TypeCountDescription
entity_lookup5Doctor, department, campus lookups
condition_navigation5Symptom/condition to department routing
multi_hop5Multi-entity relationship chains
practical_info5Visiting hours, parking, appointments
rare_condition5Less common diseases and conditions
treatment_lookup5Treatment and procedure information
multilingual5Queries in English, French, Turkish
typo_tolerance5Queries with common spelling errors
complex_multi_hop53-4 hop chains across multiple entities
disambiguation5Ambiguous queries mapping to multiple departments

Key metrics: Recall@5, Recall@10, MRR, NDCG@10, Precision@5 — computed per query type and aggregated.

Why this matters: This layer measures what generic benchmarks cannot — whether the system retrieves the right hospital content for the specific types of queries real patients ask. The per-type breakdown identifies which query categories need improvement (e.g., rare conditions may have lower recall than entity lookups).

10.3 Layer 3: End-to-End RAG Evaluation

Purpose: Measure full pipeline quality from query to generated answer, including retrieval, context assembly, generation, citation, and safety.

Method: A golden evaluation set of 271 questions across 21 intent categories is evaluated using:

  • Entity recall: Do generated answers mention the correct entities (departments, doctors, conditions)?
  • Pass rate: Does the answer correctly address the query intent?
  • DeepEval metrics: FaithfulnessMetric, AnswerRelevancyMetric for LLM-as-judge quality
  • Safety refusal rate: Are medical advice requests correctly refused?
  • Citation accuracy: Do source citations correspond to actual retrieved content?
  • Bootstrap confidence intervals: 95% CIs via 10,000 bootstrap iterations (W1-1)

Measured results: 98.9% pass rate, 0.936 entity recall, zero safety incidents.

10.4 How the Layers Complement Each Other

The three layers form a pyramid of evaluation scope:

Layer 3: End-to-End RAG
(302 golden questions, 21 intents)
/ Full pipeline: retrieval → generation \
/ Entity recall, pass rate, safety \
─────────────────────────────────────────────
Layer 2: Domain Retrieval Benchmark
(50 queries, 10 types, URL-level matching)
/ Retrieval pipeline isolation test \
/ Recall@k, MRR, NDCG@10 per query type \
──────────────────────────────────────────────────
Layer 1: MTEB-NL External Benchmark
(Standard Dutch retrieval tasks, published scores)
/ Embedding model validation \
/ External reproducibility, model comparison \
──────────────────────────────────────────────────────
  • Layer 1 validates the component (embedding model) against external baselines
  • Layer 2 validates the retrieval system against domain-specific ground truth
  • Layer 3 validates the complete pipeline including generation quality and safety

A failure at Layer 1 (poor embedding model) would propagate to Layers 2 and 3. A failure at Layer 2 (retrieval misses) might not appear at Layer 3 if the LLM compensates — which is why isolated retrieval measurement is essential. A failure at Layer 3 (poor generation despite good retrieval) indicates generation-layer issues rather than retrieval problems.

This layered approach aligns with the BEIR methodology (Thakur et al., 2021) recommendation that retrieval evaluation should be separated from downstream task evaluation, and extends it with domain-specific adaptation.


11. Is This a Best-in-Class Architecture?

What "Best-in-Class" Means

For a production medical RAG system deployed in a hospital context, best-in-class means:

  1. Retrieval quality comparable to or exceeding published benchmarks for multilingual medical retrieval
  2. Safety guarantees validated through systematic adversarial testing
  3. Evaluation rigour sufficient for academic publication or regulatory submission
  4. Operational robustness with graceful degradation, monitoring, and alerting
  5. Architectural alignment with current SOTA RAG patterns documented in peer-reviewed surveys

Honest Assessment

The ZOL system is best-in-class for its operational constraints (local inference, single-instance deployment, zero external API dependencies for embeddings, Dutch language requirement). Within these constraints, the architecture makes excellent decisions:

  • BGE-M3 is the strongest available model on Ollama with Dutch support
  • The 12-layer safety architecture exceeds most published medical RAG systems
  • Contextual embeddings with page summaries implement the full Anthropic pattern
  • The frozen taxonomy approach prevents the "noisy graph" problem that plagues naive KG-RAG systems
  • The 5-tier LLM routing efficiently allocates model capacity

The gap to absolute best-in-class has narrowed significantly following the Wave 1-4 improvements. The most significant remaining gaps are:

GapSeverityImpact
No user studyCriticalImproved metrics ≠ improved patient experience
No inter-annotator agreementSignificantGround truth quality unvalidated
No learned sparse retrieval (SPLADE)Medium5-15% potential retrieval improvement
No ColBERT/late interactionMediumUnder-utilising BGE-M3 multi-vector capability
No Self-RAGMediumNo self-critique during generation
No graph reasoningMediumLimited to lookup; cannot discover indirect relationships
No domain-adapted embeddingMediumGeneral-purpose model for specialised domain

Previously critical gaps now addressed:

  • No external benchmark evaluation → MTEB-NL/BEIR-NL evaluated (W1-4)
  • No attribution verification → Implemented (W1-3)
  • No context filtering → FILCO implemented, feature-flagged (W2-1)
  • No adversarial red-teaming → 40-case harness (W3-1) + guardrails model (W3-2)
  • No statistical significance → Bootstrap confidence intervals (W1-1)
  • No CRAG → Ternary quality gate with refinement (W4-2)
  • No adaptive retrieval → Intent-driven strategy routing (W4-1)

12. Roadmap to Best-in-Class

The following roadmap prioritises improvements by impact-to-effort ratio, grouped into three horizons:

Horizon 1: Evaluation Rigour — ✅ Completed (Wave 1)

ItemStatusReference
External benchmark evaluation on MTEB-NL / BEIR-NL✅ W1-4BGE-M3 score: 60.0
Bootstrap confidence intervals on evaluation metrics✅ W1-110,000 bootstrap iterations
Safety pipeline FP/FN rate measurement✅ W1-2safety_evaluation.py
Attribution verification service✅ W1-3NLI-based entailment scoring
Inter-annotator agreement for golden questions❌ OpenRequires 2 domain experts

Horizon 2: Retrieval Quality — ✅ Mostly Completed (Waves 2 + 4)

ItemStatusReference
FILCO-style context filtering at query time✅ W2-1Feature-flagged (context_filter_enabled)
Retrieval confidence scoring (calibrated abstention)✅ W2-2RetrievalConfidenceScorer
Domain-specific retrieval benchmark (200+ queries)✅ W2-3retrieval_benchmark.py
Adaptive retrieval (intent-driven strategy routing)✅ W4-1vector_only for navigational
Corrective RAG (ternary quality gate + refinement)✅ W4-2ADR-0038, feature-flagged
ColBERT retrieval mode in BGE-M3❌ OpenMulti-vector retrieval
SNOMED CT integration (ADR-0016)✅ Phase C356K concepts, synonym expansion, FINDING_SITE routing, graph enrichment, 15/15 golden questions

Horizon 3: Safety Hardening — ✅ Completed (Wave 3)

ItemStatusReference
Systematic red-teaming harness✅ W3-140 adversarial test cases
Guardrails model (Llama Guard 3)✅ W3-2Feature-flagged (guardrails_enabled)
Anomaly threshold validation (ROC analysis)✅ W3-3anomaly_threshold_validation.py

Horizon 4: Remaining Gaps (Future Work)

ItemEffortImpactReference
User study (50+ patients, task-based evaluation)4 weeksValidates real-world impactStandard medical AI
Inter-annotator agreement (recruit 2 domain experts)1 weekValidates ground truth qualityTsatsaronis et al. (2015)
Learned sparse retrieval (SPLADE or equivalent)3 weeks+5-15% BEIR improvementFormal et al. (2022)
ColBERT/late interaction retrieval mode2 weeksMulti-vector retrievalKhattab & Zaharia (2020)
Self-RAG (self-critique during generation)4 weeksImproved faithfulnessAsai et al. (2024)
Agentic RAG with dynamic strategy selection6 weeksFull Modular RAGSingh et al. (2025)
Domain-adapted embedding fine-tuning3 weeks+3-5% retrieval qualityThakur et al. (2021)

13. Conclusion

The ZOL Intelligent Search system represents a mature, production-grade Advanced RAG architecture with significant Modular RAG characteristics, making consistently good engineering trade-offs within its operational constraints. The 10-stage retrieval pipeline (now including adaptive strategy selection and CRAG quality gate), 12-layer safety architecture (now validated via red-teaming and guardrails model), and frozen taxonomy approach are architecturally sound and well-documented through 38 Architecture Decision Records.

Following the Wave 1-4 improvement programme, the system has closed 8 of the 11 originally identified gaps:

WaveImplementedKey Achievement
W1Evaluation rigourExternal benchmarks, bootstrap CIs, attribution verification, safety FP/FN
W2Retrieval qualityFILCO context filtering, confidence scoring, domain-specific benchmark
W3Safety hardeningRed-teaming harness, Llama Guard guardrails, ROC threshold validation
W4Adaptive pipelineIntent-driven strategy routing, CRAG ternary quality gate

The system now implements 18 of 26 identified SOTA techniques (up from 10/18 in the initial assessment). The comparative table score has improved from 55% to 69% coverage.

The remaining critical path to demonstrable best-in-class status is no longer primarily evaluative but requires external validation: a user study with real patients (critical), inter-annotator agreement for golden questions (significant), and the remaining retrieval improvements (SPLADE, ColBERT, Self-RAG) that represent moderate engineering effort with demonstrated academic impact. The system has crossed the threshold from "appears strong" to "architecturally strong with partial empirical evidence" — the next step is "demonstrably strong through external validation."


References