Academic Critical Assessment

Embedding-stack note

The Section 3 ("Embedding Strategy") evaluation below was written when production used BGE-M3 (Chen et al., 2024) at 1024 dim via on-prem Ollama. As of ADR-0048 (2026-04-30) the system uses OpenAI text-embedding-3-large at 1536 dim, hosted. The discussion of multilingual coverage, dimensionality trade-offs, and multi-vector retrieval still applies — most of the BGE-M3-specific gaps (no ColBERT mode, no learned-sparse, no domain fine-tuning) hold for text-embedding-3-large as well, since it is a single dense embedder. The paragraphs are preserved verbatim because they remain the academic critique that motivated subsequent improvement work; only the model-name labels have moved on.

This chapter provides an honest and critical evaluation of the ZOL Intelligent Search system architecture, measured against the current state of the art in Retrieval-Augmented Generation (RAG), knowledge graph integration, adversarial robustness, and medical AI safety. The assessment identifies both demonstrated strengths and architectural gaps, concluding with a concrete roadmap for achieving best-in-class status.

Methodology Note

This assessment evaluates architectural decisions and implementation quality against published academic benchmarks and production RAG surveys (Gao et al., 2024; Fan et al., 2024; Peng et al., 2025). Where the ZOL system has not been evaluated on standardised benchmarks (e.g., BEIR, MTEB-NL retrieval), this is explicitly noted as a gap. Self-reported metrics from internal golden evaluations are referenced but acknowledged as non-comparable to external benchmarks.

1. Overall Architecture Classification

Gao et al. (2024) classify RAG systems into three generations:

Generation	Description	Key Features
Naive RAG	Basic retrieve-then-generate	Single retrieval, no post-processing
Advanced RAG	Pre-retrieval and post-retrieval optimisation	Query rewriting, reranking, metadata boosting
Modular RAG	Composable, interchangeable components	Pluggable retrievers, adaptive routing, agent-based orchestration

Assessment: The ZOL system is a mature Advanced RAG with significant Modular RAG characteristics. It implements pre-retrieval optimisation (intent classification, taxonomy enrichment, query decomposition), parallel multi-channel retrieval (vector + BM25 + knowledge graph), post-retrieval refinement (RRF fusion, metadata boosting, cross-encoder reranking), context enrichment (contextual embeddings, page summaries), and — since Wave 4 — adaptive retrieval strategy selection (W4-1) and a Corrective RAG (CRAG) quality gate (W4-2, Yan et al., 2024) that classifies retrieval confidence and triggers refinement for ambiguous results. The modular elements include configurable model routing (5-tier LLM hierarchy), pluggable reranker backends (Jina API / local BGE), feature-flagged query decomposition, intent-driven strategy routing, and a ternary pre-generation quality gate with automatic retrieval refinement.

The system now implements partial adaptive orchestration: retrieval strategies are selected based on intent classification (W4-1), and retrieval results are evaluated post-retrieval via CRAG (W4-2) — if classified as AMBIGUOUS, the system automatically retries with relaxed parameters (lower similarity threshold, expanded result set, no category filter). However, the system cannot yet dynamically select between retrievers mid-execution or perform multi-hop retrieval chains, which are hallmarks of full agentic RAG (Singh et al., 2025; Trivedi et al., 2023).

Verdict: ★★★★☆ — Strong Advanced RAG with meaningful Modular RAG features (adaptive strategy + CRAG). Approaching but not yet fully agentic.

2. Retrieval Architecture

2.1 Hybrid Search (Vector + BM25)

Strengths: The combination of dense vector search (BGE-M3, 1024-dim) with sparse BM25 keyword search, fused via Reciprocal Rank Fusion (Cormack et al., 2009), represents the current production standard for enterprise RAG (Bruch et al., 2023). The ZOL system further enhances this with:

Contextual embeddings (Anthropic, 2024) prepended at ingestion time — reduces retrieval failure by 49%
Canonical question generation for BM25 enrichment — partially implements the HyPE pattern (Vake et al., 2025)
Keyword rescue as a safety net for rare terms missed by both channels

Gaps:

No learned sparse retrieval: The system uses PostgreSQL tsvector with 'simple' tokenisation (no stemming). Modern sparse retrieval models like SPLADE (Formal et al., 2022) learn term importance weights that outperform raw BM25 by 5-15% on BEIR benchmarks. The 'simple' configuration, while appropriate for preserving Dutch medical terms, sacrifices the morphological normalisation that would help with Dutch inflections (e.g., "behandeling" vs. "behandelingen").
No ColBERT/late interaction retrieval: BGE-M3 supports ColBERT retrieval mode (multi-vector matching per token), but this capability is not utilised. ColBERT provides a middle ground between bi-encoder speed and cross-encoder accuracy, and is particularly effective for long medical queries where individual term-level matching matters (Khattab & Zaharia, 2020).
~~No query-adaptive retrieval~~ Partially addressed (W4-1): The system now implements intent-driven adaptive retrieval strategy selection — navigational queries use vector_only, entity-specific queries use graph_first, and complex medical queries use full hybrid. This implements the selective channel activation recommended by Adaptive RAG (Jeong et al., 2024). However, the strategy is fixed at pipeline start based on intent classification and cannot be revised mid-execution based on intermediate retrieval quality.
Canonical questions are BM25-only: The generated canonical questions enrich BM25 but are not embedded as separate vectors. Full HyPE implementation (Vake et al., 2025) would embed hypothetical questions alongside document chunks, providing additional vector-space retrieval paths.

2.2 Knowledge Graph Integration

Strengths: The PostgreSQL taxonomy with typed entities (doctors, departments, conditions, treatments, campuses, examinations) and curated relationships (HANDLES, OFFERS, WORKS_IN, LOCATED_AT) provides structured entity traversal that vector search cannot replicate. The frozen taxonomy approach (ADR-0028) with LLM-validated hub pages ensures high taxonomy data quality — a critical requirement identified by Peng et al. (2025) in their GraphRAG survey.

The separation of graph seeding from document ingestion is architecturally sound: it prevents the "noisy graph" problem documented in early GraphRAG implementations (Edge et al., 2024), where unrestricted entity extraction from all documents produces low-quality relationships that degrade retrieval.

Gaps:

No graph-based reasoning: The current graph integration is purely lookup-based (Cypher queries for entity relationships). True GraphRAG (Peng et al., 2025) involves graph-guided retrieval where the graph structure informs the retrieval strategy — e.g., traversing relationship chains to discover relevant documents that wouldn't be found by similarity search. The ZOL system's graph results are simply merged with vector/BM25 results rather than guiding the retrieval process.
No graph embeddings: The knowledge graph nodes have no learned embeddings. Knowledge graph embedding methods (TransE, ComplEx, RotatE) could enable similarity search over the graph structure, finding related entities even when exact relationship paths don't exist in the taxonomy.
Static taxonomy: The frozen taxonomy (zol_taxonomy.py) is manually maintained. While this ensures quality, it cannot scale to new conditions, treatments, or organisational changes without developer intervention. An automated taxonomy update pipeline with LLM-based validation (as proposed in ADR-0028) would improve maintainability.
Partial ontology alignment: The system integrates SNOMED CT Belgian Edition (356K concepts, 656K descriptions, 4.7M transitive closure relationships) via ADR-0016. Query-time synonym expansion, FINDING_SITE-based department routing, and graph enrichment with SNOMED concept IDs and IS_A relationships are implemented (15/15 SNOMED golden questions pass). Remaining gaps: IS_A hierarchical traversal is not used at query time for broad-category queries, and cross-language descriptions (French, English) are imported but not loaded into the lookup tables.

2.3 Reranking

Strengths: Always-on cross-encoder reranking via Jina Reranker v2 (with local bge-reranker-v2-m3 fallback) implements the two-stage retrieval paradigm established by Nogueira and Cho (2019). The candidate reduction from 50 to 20 (ADR-0034) was validated through A/B testing showing equivalent MRR and NDCG@5.

Gaps:

No listwise reranking: Current reranking is pointwise (each query-document pair scored independently). Listwise reranking approaches (Pradeep et al., 2023) that consider all candidates simultaneously produce more calibrated rankings but require LLM-based rerankers.
No domain-adapted reranker: Neither Jina nor bge-reranker-v2-m3 is fine-tuned for Dutch medical content. Domain adaptation of rerankers has been shown to improve NDCG@5 by 3-8% in specialised domains (Thakur et al., 2021).

Verdict: ★★★★☆ — Strong production-grade retrieval. The hybrid search + reranking architecture is state-of-the-art for production systems. Adaptive retrieval (W4-1) partially closes the strategy gap. Key remaining gaps are ColBERT utilisation and graph-based reasoning.

3. Embedding Strategy

Strengths: BGE-M3 (Chen et al., 2024) is the strongest open-source multilingual embedding model available on Ollama, with a measured MTEB-NL retrieval score of 60.0. Local inference via Ollama ensures zero API cost and full data sovereignty — critical requirements for healthcare deployments. The contextual embedding approach (prepending LLM-generated context before embedding) is directly aligned with Anthropic's (2024) research showing 35-49% retrieval failure reduction.

Gaps:

Not benchmarked on ZOL-specific retrieval: The MTEB-NL score of 60.0 was measured on general Dutch retrieval tasks. No ZOL-domain-specific retrieval benchmark exists, so the actual quality for Dutch medical queries is inferred but not measured. Creating a domain-specific evaluation set (analogous to the golden questions but focused specifically on retrieval ranking rather than end-to-end answer quality) would provide this measurement.
No fine-tuning: BGE-M3 is used as-is without fine-tuning on Dutch medical text. Domain-specific fine-tuning using contrastive learning on the ZOL corpus could improve retrieval quality by 3-5% based on analogous domain adaptation results (Thakur et al., 2021). However, the cost-benefit trade-off is unclear given the existing contextual embedding enrichment.
No embedding compression: At 1024 dimensions, BGE-M3 embeddings consume 4KB per vector. Matryoshka representation learning (Kusupati et al., 2022) enables adaptive dimension reduction without retraining, but BGE-M3 does not support this natively. Quantisation (e.g., IVFPQ in pgvector) could reduce storage and improve query speed.
No multi-vector retrieval: BGE-M3 supports dense, sparse, and ColBERT retrieval modes simultaneously. Only the dense mode is used. Activating sparse and ColBERT modes would create a multi-vector retrieval system with potentially significant recall improvements.

Verdict: ★★★☆☆ — Good model selection within constraints. Local inference is a strong operational decision. However, the lack of domain-specific benchmarking and underutilisation of BGE-M3's multi-vector capabilities leave measurable improvement opportunities.

4. Context Assembly and Generation

4.1 Context Filtering

Strengths: The three-level contextual retrieval implementation (embedding-time context, BM25-time enrichment, generation-time page summaries) is a comprehensive approach to the context quality problem. The ±1 chunk expansion with overlap deduplication preserves document coherence while managing token budget.

Gaps:

~~No query-time context filtering (FILCO)~~ Implemented (W2-1), feature-flagged: A FILCO-style sentence-level context filtering service is now implemented (context_filter_enabled, default: off) and wired into the pipeline at Step 6c. When enabled, it scores individual sentences within retrieved chunks for query relevance and removes low-scoring passages before generation. This partially addresses Wang et al.'s (2023) finding that filtering reduces prompt lengths by up to 64% while improving answer quality. The implementation uses lexical overlap scoring rather than the full conditional cross-mutual information approach from the original paper.
Fixed token budget: The 8,000-token context budget is static. Adaptive token allocation based on query complexity (simple questions need less context, multi-hop questions need more) could improve both efficiency and quality.
No context compression: Long-context compression techniques (e.g., LLMLingua by Jiang et al., 2023) can reduce context length by 2-5x while preserving answer quality, enabling more documents to fit within the budget.

4.2 Generation

Strengths: The 5-tier LLM routing (nano/mini/standard/escalation/flagship) efficiently allocates model capacity to query complexity. Streaming responses with progress indicators address user experience requirements (Nielsen, 1993). The strict grounding prompt enforcing citation with [1] notation implements basic attribution.

Gaps:

~~No attribution verification~~ Implemented (W1-3): An AttributionVerificationService now provides post-hoc citation checking using NLI-based entailment scoring. The service verifies whether each citation actually supports the corresponding claim, following Gao et al. (2023). This is available as an evaluation tool and can be integrated into the generation pipeline for runtime verification.
~~No abstention mechanism~~ Implemented (W2-2 + W4-2): A RetrievalConfidenceScorer computes a weighted confidence score (50% top_score + 30% mean_top_k + 20% score_gap) with a configurable abstention threshold. This was further extended by the CRAG quality gate (W4-2, ADR-0038), which classifies retrieval as CORRECT/AMBIGUOUS/INCORRECT and automatically refuses generation when confidence is below threshold — implementing confidence-calibrated abstention (Ren et al., 2023) independent of LLM judgement.
~~No Self-RAG or CRAG~~ CRAG implemented (W4-2): Corrective RAG (Yan et al., 2024) is now implemented via the CRAGDecision ternary classifier. Retrieval is classified as CORRECT (generate), AMBIGUOUS (refine with relaxed parameters then re-assess), or INCORRECT (abstain). The AMBIGUOUS path triggers automatic retrieval refinement with lower similarity threshold, expanded result set, and removed category filters — adding ~0.5-1s latency only for borderline queries. Feature-flagged via crag_enabled (default: off). Self-RAG is not yet implemented.

Verdict: ★★★★☆ — Significantly improved since initial assessment. FILCO context filtering (W2-1), attribution verification (W1-3), confidence-calibrated abstention (W2-2), and CRAG (W4-2) close the most significant generation-quality gaps. Remaining gaps: Self-RAG, adaptive token budget, and context compression.

5. Safety and Adversarial Robustness

Strengths: The 12-layer defence-in-depth architecture (ADR-0036) is significantly more comprehensive than most production RAG systems. Key highlights:

Perplexity-based anomaly detector (H1) catches GCG-style adversarial suffixes in under 5ms using statistical heuristics — a novel, cost-effective approach that doesn't require an LLM call
LLM-as-judge safety validation (H2) enabled by default with intent-based skip optimisation
In-memory rate limiter fallback (H3) with burst protection prevents fail-open scenarios
Streaming retraction with server-side enforcement (H4) and WebSocket close code 4001

The multi-layer approach aligns with the defence-in-depth principle recommended by Zou et al. (2023) for protecting against universal adversarial attacks.

Gaps:

~~No red-teaming evaluation~~ Implemented (W3-1): A systematic red-teaming harness with 40 adversarial test cases covering GCG-style suffixes, prompt injection, context manipulation, and role-play attacks is now available (tests/evaluation/red_teaming.py). The harness tests the full safety pipeline against established attack patterns (Perez et al., 2022).
~~No input/output guardrails model~~ Implemented (W3-2): A GuardrailsService integrating Llama Guard 3 (via OpenRouter) provides trained classifier-based input/output safety validation. Feature-flagged via guardrails_enabled (default: off). This supplements the existing regex + statistical heuristic layers with a dedicated safety classification model, addressing the gap for detecting sophisticated paraphrased attacks.
~~No formal safety evaluation framework~~ Implemented (W1-2 + W3-3): A safety evaluation framework (tests/evaluation/safety_evaluation.py) measures false positive rates (safe queries incorrectly blocked) and false negative rates (unsafe responses not caught) across the full safety pipeline. Additionally, an anomaly threshold validation tool (W3-3) performs ROC curve analysis to optimise detector thresholds against labelled adversarial and benign corpora, quantifying safety trade-offs as recommended by Patel et al. (2025).
Perplexity detector false positives: The statistical anomaly detector may flag legitimate queries containing code-switched medical terminology, URLs, or non-Latin scripts. The thresholds are now validated via ROC analysis (W3-3) against a labelled dataset of adversarial and benign queries, but production-scale validation with real user traffic has not yet been conducted.

Verdict: ★★★★★ — The safety architecture now includes systematic red-teaming (W3-1), a guardrails model (W3-2), quantified FP/FN measurement (W1-2), and threshold validation via ROC analysis (W3-3). This represents a comprehensive defence-in-depth posture that exceeds published production RAG safety architectures. The remaining gap is production-scale validation with real user traffic.

6. Evaluation Methodology

Strengths: The golden evaluation framework (302 questions across 21 intent categories, v3.6) provides a reproducible, deterministic evaluation of end-to-end system quality. The primary metrics -- entity recall, pass rate, citation accuracy, safety refusal rate -- cover the key quality dimensions. The evaluation distinguishes between retrieval quality and generation quality, enabling targeted debugging.

Note on NDCG@5 / MRR: The golden evaluation reports include NDCG@5 and MRR as retrieval metrics, but these values are near-zero (typically 0.000-0.055) due to a URL granularity mismatch: expected_source_urls are defined at a coarse department-page level (e.g. /cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures. Without fine-grained per-document relevance judgments, these metrics cannot be meaningfully computed. The system's retrieval quality is better reflected by entity recall (0.94+) and pass rate (98.9%), which measure end-to-end answer quality.

Gaps:

~~No external benchmark evaluation~~ Implemented (W1-4): An MTEB-NL/BEIR-NL benchmark harness (tests/evaluation/mteb_nl_benchmark.py) evaluates the BGE-M3 embedding model on standardised Dutch retrieval tasks. The measured MTEB-NL retrieval score of 60.0 provides an external reference point for the embedding model choice. A domain-specific retrieval benchmark (W2-3) with 200+ queries across ZOL-specific categories was also created.
No inter-annotator agreement: The golden questions were created by a single annotator (the developer). Medical information retrieval evaluation requires multi-annotator agreement scores to validate ground truth quality (Tsatsaronis et al., 2015). Without inter-annotator agreement, the evaluation may reflect a single person's expectations rather than true information need. This remains an open gap.
~~No statistical significance testing~~ Implemented (W1-1): Bootstrap confidence intervals are now computed for all evaluation metrics (tests/evaluation/statistical_analysis.py). Given the 146-question evaluation set, 95% confidence intervals quantify the reliability of observed improvements via bootstrap resampling (10,000 iterations). This addresses the point-estimate-only reporting gap.
No user-based evaluation: All evaluation is offline (golden questions). No user study has been conducted to validate that improved retrieval metrics correlate with improved user satisfaction and task completion. In medical search contexts, user-based evaluation is particularly important because patients may have different information needs than the system designer assumes. This remains an open gap.
Limited LLM-as-judge validation: The system uses DeepEval's FaithfulnessMetric and AnswerRelevancyMetric for quality analytics, but the LLM judge itself has not been validated against human judgements for Dutch medical content. Zheng et al. (2023) showed that LLM judges have systematic biases that vary by language and domain. This remains an open gap.

Verdict: ★★★☆☆ — Significantly improved from the initial assessment. External benchmarks (W1-4), domain-specific retrieval benchmarks (W2-3), and bootstrap confidence intervals (W1-1) bring the evaluation closer to academic standards. The remaining critical gaps are inter-annotator agreement and user-based evaluation.

7. Incremental Crawling and Data Freshness

Strengths: The content-hash-based change detection for incremental updates implements a well-established approach from the web crawling literature (Cho & Garcia-Molina, 2003). The sitemap-driven discovery ensures comprehensive URL coverage. Content deduplication by title prevents duplicate documents from different URL paths.

Gaps:

No change frequency estimation: Cho and Garcia-Molina (2003) demonstrated that Poisson-based change frequency estimators improve crawl freshness by 35%. The ZOL system treats all URLs equally during re-crawls rather than prioritising frequently-changing content (e.g., doctor schedules, visiting hours).
No differential update: When a document changes, the entire document is re-processed (re-chunked, re-embedded). A differential update approach that identifies changed sections and updates only affected chunks would reduce re-embedding costs.
No freshness monitoring: There is no automated monitoring of content freshness — no alerts when crawled content becomes stale, no automatic re-crawl scheduling, no freshness metrics in the analytics dashboard.

Verdict: ★★★☆☆ — Functional incremental ingestion. Missing optimisation opportunities for change-frequency-based scheduling and differential updates.

8. Comparative Analysis: ZOL vs. State-of-the-Art

The following table compares the ZOL system against key techniques identified in recent RAG surveys (Gao et al., 2024; Fan et al., 2024; Peng et al., 2025):

Technique	ZOL Status	SOTA Benchmark	Gap
Hybrid search (vector + BM25)	✅ Implemented	Standard practice	None
Cross-encoder reranking	✅ Always-on (Jina + BGE fallback)	Standard practice	No domain adaptation
Contextual embeddings	✅ Full (embed + BM25 + gen-time)	Anthropic (2024): -49% failure	None — fully aligned
Knowledge graph integration	✅ Implemented (Neo4j typed nodes)	Peng et al. (2025): GraphRAG	Lookup-only, no graph reasoning
RRF score fusion	✅ k=60	Cormack et al. (2009)	None
Query decomposition	✅ Feature-flagged	Ammann et al. (2025): +36.7% MRR	None
Metadata boosting	✅ 9 signals	Novel (domain-specific)	No published comparison
Adversarial hardening	✅ 12 layers + anomaly detector	Zou et al. (2023): GCG defence	~~No red-team validation~~ Validated (W3-1)
Context filtering (FILCO)	✅ Implemented (W2-1), feature-flagged	Wang et al. (2023): -64% prompt	Lexical overlap only (no CMI)
CRAG (Corrective RAG)	✅ Implemented (W4-2), feature-flagged	Yan et al. (2024)	None — ternary gate with refinement
Adaptive retrieval	✅ Implemented (W4-1)	Jeong et al. (2024)	Intent-driven only (not mid-pipeline)
Attribution verification	✅ Implemented (W1-3)	Gao et al. (2023)	Available as evaluation tool
Retrieval confidence / abstention	✅ Implemented (W2-2)	Ren et al. (2023)	None — calibrated abstention
Guardrails model	✅ Implemented (W3-2), feature-flagged	Llama Guard 3	None
External benchmark (MTEB-NL/BEIR-NL)	✅ Evaluated (W1-4)	Standard practice	BGE-M3 score: 60.0
Bootstrap confidence intervals	✅ Implemented (W1-1)	Standard practice	None
Safety FP/FN measurement	✅ Implemented (W1-2 + W3-3)	Patel et al. (2025)	ROC threshold validation
Domain-specific retrieval benchmark	✅ Implemented (W2-3)	BEIR methodology	200+ queries
Learned sparse retrieval (SPLADE)	❌ Not implemented	+5-15% on BEIR	Moderate
ColBERT/late interaction	❌ Not implemented	Khattab & Zaharia (2020)	Moderate
Self-RAG	❌ Not implemented	Asai et al. (2024)	Moderate (latency cost)
User study	❌ Not conducted	Standard for medical AI	Critical
Domain-adapted embedding	❌ Not implemented	Thakur et al. (2021)	Moderate
Agentic RAG	❌ Not implemented	Singh et al. (2025)	Future direction
Inter-annotator agreement	❌ Not conducted	Tsatsaronis et al. (2015)	Significant

Summary: 18/26 SOTA techniques implemented (up from 10/18). 1 significant gap (inter-annotator agreement), 3 moderate gaps (SPLADE, ColBERT, Self-RAG), 1 critical gap (user study), 1 future direction (agentic RAG).

9. Why Generic Medical QA Benchmarks Don't Apply

A common critique of domain-specific RAG systems is the absence of evaluation against established medical QA benchmarks such as MedQA (Jin et al., 2021), PubMedQA (Jin et al., 2019), MedMCQA (Pal et al., 2022), or MMLU-Medical (Hendrycks et al., 2021). While these benchmarks are appropriate for clinical decision support systems and medical knowledge models, they are fundamentally misaligned with the ZOL system for several reasons:

9.1 Scope Mismatch: Hospital-Specific vs. General Medical Knowledge

The ZOL system operates on a closed corpus of approximately 1,000 hospital-specific documents (department pages, brochures, doctor profiles, patient guides). It does not contain — and is explicitly designed not to answer — general medical knowledge questions. Approximately 80% of questions in MedQA or PubMedQA would be entirely out-of-scope because they concern diagnoses, drug interactions, or clinical protocols that are not part of the ZOL website content.

For example, a MedQA question like "What is the first-line treatment for community-acquired pneumonia?" expects clinical guideline knowledge. The ZOL system would correctly respond with a safety refusal or a navigational redirect to the Pneumology department — which would be scored as "incorrect" by MedQA metrics despite being the appropriate system behaviour.

The ZOL system is navigational and informational, not a clinical decision support tool. Its primary task is answering queries like:

"Which department handles heart problems?" (entity lookup)
"How do I prepare for a colonoscopy?" (patient guide retrieval)
"Which doctors work at the Oncology department?" (entity relationship traversal)

These tasks have no equivalent in MedQA, PubMedQA, or MMLU-Medical, which test medical reasoning, evidence synthesis, and clinical judgement. Evaluating a hospital navigation system on clinical reasoning benchmarks is analogous to evaluating a library catalogue system on reading comprehension — it measures the wrong capability.

9.3 Language and Domain Specificity

The ZOL system operates primarily in Dutch on Belgian hospital content. None of the major medical QA benchmarks provide Dutch-language evaluation sets. While MIRACL (Zhang et al., 2023) includes Dutch retrieval tasks, it does not cover medical QA. The closest applicable benchmark is MTEB-NL for retrieval quality (Layer 1 of our evaluation), which evaluates the embedding model on general Dutch information retrieval.

9.4 The Appropriate Evaluation Strategy

Thakur et al. (2021) demonstrated that domain-specific evaluation is essential because generic benchmarks systematically overestimate or underestimate system quality for specialised use cases. Following this principle, the ZOL system uses a three-layer evaluation architecture (described in Section 10) that combines external benchmarks for component validation with domain-specific benchmarks for system-level quality measurement.

10. Three-Layer Evaluation Architecture

The ZOL evaluation framework follows a layered approach that addresses the limitations of generic benchmarks while maintaining scientific rigour through external reference points.

10.1 Layer 1: MTEB-NL / BEIR-NL — Embedding Model Validation

Purpose: Validate the embedding model choice (BGE-M3) against published Dutch retrieval leaderboards.

Method: The mteb_nl_benchmark.py runner evaluates BGE-M3 on standardised MTEB retrieval tasks including Dutch content. This provides an external, reproducible reference point for the embedding model's retrieval capability independent of the ZOL domain.

Key metrics: NDCG@10, MRR, Recall@100 — aggregated across available Dutch retrieval tasks.

Measured result: BGE-M3 achieves an MTEB-NL retrieval score of 60.0, positioning it as the strongest open-source multilingual model available for local inference via Ollama.

Limitation: General Dutch retrieval does not measure medical domain performance. This layer validates the foundation (embedding quality) but not the application (hospital search).

10.2 Layer 2: Domain-Specific ZOL Retrieval Benchmark

Purpose: Measure retrieval quality for hospital-specific queries using a curated test set with known expected source URLs.

Method: A benchmark of 50 queries across 10 query types evaluates whether the retrieval pipeline (vector + BM25 + knowledge graph, fused via RRF, reranked via cross-encoder) returns the correct hospital pages for each query. Query types include:

Type	Count	Description
entity_lookup	5	Doctor, department, campus lookups
condition_navigation	5	Symptom/condition to department routing
multi_hop	5	Multi-entity relationship chains
practical_info	5	Visiting hours, parking, appointments
rare_condition	5	Less common diseases and conditions
treatment_lookup	5	Treatment and procedure information
multilingual	5	Queries in English, French, Turkish
typo_tolerance	5	Queries with common spelling errors
complex_multi_hop	5	3-4 hop chains across multiple entities
disambiguation	5	Ambiguous queries mapping to multiple departments

Key metrics: Recall@5, Recall@10, MRR, NDCG@10, Precision@5 — computed per query type and aggregated.

Why this matters: This layer measures what generic benchmarks cannot — whether the system retrieves the right hospital content for the specific types of queries real patients ask. The per-type breakdown identifies which query categories need improvement (e.g., rare conditions may have lower recall than entity lookups).

10.3 Layer 3: End-to-End RAG Evaluation

Purpose: Measure full pipeline quality from query to generated answer, including retrieval, context assembly, generation, citation, and safety.

Method: A golden evaluation set of 271 questions across 21 intent categories is evaluated using:

Entity recall: Do generated answers mention the correct entities (departments, doctors, conditions)?
Pass rate: Does the answer correctly address the query intent?
DeepEval metrics: FaithfulnessMetric, AnswerRelevancyMetric for LLM-as-judge quality
Safety refusal rate: Are medical advice requests correctly refused?
Citation accuracy: Do source citations correspond to actual retrieved content?
Bootstrap confidence intervals: 95% CIs via 10,000 bootstrap iterations (W1-1)

Measured results: 98.9% pass rate, 0.936 entity recall, zero safety incidents.

10.4 How the Layers Complement Each Other

The three layers form a pyramid of evaluation scope:

         Layer 3: End-to-End RAG
        (302 golden questions, 21 intents)
       /  Full pipeline: retrieval → generation  \
      /     Entity recall, pass rate, safety      \
     ─────────────────────────────────────────────
       Layer 2: Domain Retrieval Benchmark
      (50 queries, 10 types, URL-level matching)
     /    Retrieval pipeline isolation test       \
    /     Recall@k, MRR, NDCG@10 per query type   \
   ──────────────────────────────────────────────────
     Layer 1: MTEB-NL External Benchmark
    (Standard Dutch retrieval tasks, published scores)
   /     Embedding model validation                  \
  /      External reproducibility, model comparison   \
 ──────────────────────────────────────────────────────

Layer 1 validates the component (embedding model) against external baselines
Layer 2 validates the retrieval system against domain-specific ground truth
Layer 3 validates the complete pipeline including generation quality and safety

A failure at Layer 1 (poor embedding model) would propagate to Layers 2 and 3. A failure at Layer 2 (retrieval misses) might not appear at Layer 3 if the LLM compensates — which is why isolated retrieval measurement is essential. A failure at Layer 3 (poor generation despite good retrieval) indicates generation-layer issues rather than retrieval problems.

This layered approach aligns with the BEIR methodology (Thakur et al., 2021) recommendation that retrieval evaluation should be separated from downstream task evaluation, and extends it with domain-specific adaptation.

11. Is This a Best-in-Class Architecture?

What "Best-in-Class" Means

For a production medical RAG system deployed in a hospital context, best-in-class means:

Retrieval quality comparable to or exceeding published benchmarks for multilingual medical retrieval
Safety guarantees validated through systematic adversarial testing
Evaluation rigour sufficient for academic publication or regulatory submission
Operational robustness with graceful degradation, monitoring, and alerting
Architectural alignment with current SOTA RAG patterns documented in peer-reviewed surveys

Honest Assessment

The ZOL system is best-in-class for its operational constraints (local inference, single-instance deployment, zero external API dependencies for embeddings, Dutch language requirement). Within these constraints, the architecture makes excellent decisions:

BGE-M3 is the strongest available model on Ollama with Dutch support
The 12-layer safety architecture exceeds most published medical RAG systems
Contextual embeddings with page summaries implement the full Anthropic pattern
The frozen taxonomy approach prevents the "noisy graph" problem that plagues naive KG-RAG systems
The 5-tier LLM routing efficiently allocates model capacity

The gap to absolute best-in-class has narrowed significantly following the Wave 1-4 improvements. The most significant remaining gaps are:

Gap	Severity	Impact
No user study	Critical	Improved metrics ≠ improved patient experience
No inter-annotator agreement	Significant	Ground truth quality unvalidated
No learned sparse retrieval (SPLADE)	Medium	5-15% potential retrieval improvement
No ColBERT/late interaction	Medium	Under-utilising BGE-M3 multi-vector capability
No Self-RAG	Medium	No self-critique during generation
No graph reasoning	Medium	Limited to lookup; cannot discover indirect relationships
No domain-adapted embedding	Medium	General-purpose model for specialised domain

Previously critical gaps now addressed:

~~No external benchmark evaluation~~ → MTEB-NL/BEIR-NL evaluated (W1-4)
~~No attribution verification~~ → Implemented (W1-3)
~~No context filtering~~ → FILCO implemented, feature-flagged (W2-1)
~~No adversarial red-teaming~~ → 40-case harness (W3-1) + guardrails model (W3-2)
~~No statistical significance~~ → Bootstrap confidence intervals (W1-1)
~~No CRAG~~ → Ternary quality gate with refinement (W4-2)
~~No adaptive retrieval~~ → Intent-driven strategy routing (W4-1)

12. Roadmap to Best-in-Class

The following roadmap prioritises improvements by impact-to-effort ratio, grouped into three horizons:

Horizon 1: Evaluation Rigour — ✅ Completed (Wave 1)

Item	Status	Reference
External benchmark evaluation on MTEB-NL / BEIR-NL	✅ W1-4	BGE-M3 score: 60.0
Bootstrap confidence intervals on evaluation metrics	✅ W1-1	10,000 bootstrap iterations
Safety pipeline FP/FN rate measurement	✅ W1-2	`safety_evaluation.py`
Attribution verification service	✅ W1-3	NLI-based entailment scoring
Inter-annotator agreement for golden questions	❌ Open	Requires 2 domain experts

Horizon 2: Retrieval Quality — ✅ Mostly Completed (Waves 2 + 4)

Item	Status	Reference
FILCO-style context filtering at query time	✅ W2-1	Feature-flagged (`context_filter_enabled`)
Retrieval confidence scoring (calibrated abstention)	✅ W2-2	`RetrievalConfidenceScorer`
Domain-specific retrieval benchmark (200+ queries)	✅ W2-3	`retrieval_benchmark.py`
Adaptive retrieval (intent-driven strategy routing)	✅ W4-1	`vector_only` for navigational
Corrective RAG (ternary quality gate + refinement)	✅ W4-2	ADR-0038, feature-flagged
ColBERT retrieval mode in BGE-M3	❌ Open	Multi-vector retrieval
SNOMED CT integration (ADR-0016)	✅ Phase C	356K concepts, synonym expansion, FINDING_SITE routing, graph enrichment, 15/15 golden questions

Horizon 3: Safety Hardening — ✅ Completed (Wave 3)

Item	Status	Reference
Systematic red-teaming harness	✅ W3-1	40 adversarial test cases
Guardrails model (Llama Guard 3)	✅ W3-2	Feature-flagged (`guardrails_enabled`)
Anomaly threshold validation (ROC analysis)	✅ W3-3	`anomaly_threshold_validation.py`

Horizon 4: Remaining Gaps (Future Work)

Item	Effort	Impact	Reference
User study (50+ patients, task-based evaluation)	4 weeks	Validates real-world impact	Standard medical AI
Inter-annotator agreement (recruit 2 domain experts)	1 week	Validates ground truth quality	Tsatsaronis et al. (2015)
Learned sparse retrieval (SPLADE or equivalent)	3 weeks	+5-15% BEIR improvement	Formal et al. (2022)
ColBERT/late interaction retrieval mode	2 weeks	Multi-vector retrieval	Khattab & Zaharia (2020)
Self-RAG (self-critique during generation)	4 weeks	Improved faithfulness	Asai et al. (2024)
Agentic RAG with dynamic strategy selection	6 weeks	Full Modular RAG	Singh et al. (2025)
Domain-adapted embedding fine-tuning	3 weeks	+3-5% retrieval quality	Thakur et al. (2021)

13. Conclusion

The ZOL Intelligent Search system represents a mature, production-grade Advanced RAG architecture with significant Modular RAG characteristics, making consistently good engineering trade-offs within its operational constraints. The 10-stage retrieval pipeline (now including adaptive strategy selection and CRAG quality gate), 12-layer safety architecture (now validated via red-teaming and guardrails model), and frozen taxonomy approach are architecturally sound and well-documented through 38 Architecture Decision Records.

Following the Wave 1-4 improvement programme, the system has closed 8 of the 11 originally identified gaps:

Wave	Implemented	Key Achievement
W1	Evaluation rigour	External benchmarks, bootstrap CIs, attribution verification, safety FP/FN
W2	Retrieval quality	FILCO context filtering, confidence scoring, domain-specific benchmark
W3	Safety hardening	Red-teaming harness, Llama Guard guardrails, ROC threshold validation
W4	Adaptive pipeline	Intent-driven strategy routing, CRAG ternary quality gate

The system now implements 18 of 26 identified SOTA techniques (up from 10/18 in the initial assessment). The comparative table score has improved from 55% to 69% coverage.

The remaining critical path to demonstrable best-in-class status is no longer primarily evaluative but requires external validation: a user study with real patients (critical), inter-annotator agreement for golden questions (significant), and the remaining retrieval improvements (SPLADE, ColBERT, Self-RAG) that represent moderate engineering effort with demonstrated academic impact. The system has crossed the threshold from "appears strong" to "architecturally strong with partial empirical evidence" — the next step is "demonstrably strong through external validation."

References

Ammann, P. J. L., et al. (2025). Question decomposition for retrieval-augmented generation. ACL 2025 SRW. https://arxiv.org/abs/2507.00355
Anthropic. (2024). Introducing contextual retrieval. https://www.anthropic.com/news/contextual-retrieval
Asai, A., et al. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024. https://arxiv.org/abs/2310.11511
Bruch, S., et al. (2023). An analysis of fusion functions for hybrid retrieval. ACM TOIS. https://doi.org/10.1145/3596512
Chen, J., et al. (2024). BGE M3-Embedding. https://arxiv.org/abs/2402.03216
Cho, J. & Garcia-Molina, H. (2003). Estimating frequency of change. ACM TOIT. https://doi.org/10.1145/857166.857170
Cormack, G. V., et al. (2009). Reciprocal rank fusion. SIGIR 2009. https://doi.org/10.1145/1571941.1572114
Edge, D., et al. (2024). From local to global: A graph RAG approach. Microsoft Research.
Fan, W., et al. (2024). A survey on RAG meeting LLMs. https://arxiv.org/abs/2405.06211
Formal, T., et al. (2022). From distillation to hard negative sampling: Making sparse neural IR models more effective. SIGIR 2022. https://arxiv.org/abs/2205.04733
Gao, T., et al. (2023). Enabling large language models to generate text with citations. EMNLP 2023. https://arxiv.org/abs/2305.14627
Gao, Y., et al. (2024). Retrieval-augmented generation for LLMs: A survey. https://arxiv.org/abs/2312.10997
Günther, M., et al. (2024). Late chunking. https://arxiv.org/abs/2409.04701
Jeong, S., et al. (2024). Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. NAACL 2024. https://arxiv.org/abs/2403.14403
Jiang, H., et al. (2023). LLMLingua: Compressing prompts for accelerated inference of large language models. EMNLP 2023. https://arxiv.org/abs/2310.05736
Karpukhin, V., et al. (2020). Dense passage retrieval. EMNLP 2020. https://arxiv.org/abs/2004.04906
Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. SIGIR 2020. https://arxiv.org/abs/2004.12832
Kusupati, A., et al. (2022). Matryoshka representation learning. NeurIPS 2022. https://arxiv.org/abs/2205.13147
Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401
Liao, H., et al. (2024). AmpleGCG-Plus. https://arxiv.org/abs/2410.22143
Malkov, Y. A. & Yashunin, D. A. (2018). Efficient and robust ANN search using HNSW graphs. IEEE TPAMI. https://doi.org/10.1109/TPAMI.2018.2889473
Min, S., et al. (2019). Multi-hop reading comprehension through question decomposition and rescoring. ACL 2019.
Nielsen, J. (1993). Usability engineering. Academic Press.
Nogueira, R. & Cho, K. (2019). Passage re-ranking with BERT. https://arxiv.org/abs/1901.04085
Olston, C. & Najork, M. (2010). Web crawling. FnTIR. https://doi.org/10.1561/1500000017
Patel, D., et al. (2025). A framework to assess clinical safety and hallucination rates of LLMs. npj Digital Medicine. https://www.nature.com/articles/s41746-025-01670-7
Peng, B., et al. (2025). Retrieval-augmented generation with graphs (GraphRAG). https://arxiv.org/abs/2501.00309
Perez, E., et al. (2022). Red teaming language models with language models. EMNLP 2022. https://arxiv.org/abs/2202.03286
Pradeep, R., et al. (2023). RankVicuna: Zero-shot listwise document reranking with open-source LLMs. https://arxiv.org/abs/2309.15088
Reimers, N. & Gurevych, I. (2019). Sentence-BERT. EMNLP 2019. https://arxiv.org/abs/1908.10084
Ren, J., et al. (2023). Self-evaluation improves selective generation in LLMs. https://arxiv.org/abs/2312.09300
Robertson, S. & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. FnTIR. https://doi.org/10.1561/1500000019
Sarmah, B., et al. (2024). HybridRAG. https://arxiv.org/abs/2408.04948
Singh, A., et al. (2025). Agentic retrieval-augmented generation: A survey. https://arxiv.org/abs/2501.09136
Thakur, N., et al. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. NeurIPS 2021. https://arxiv.org/abs/2104.08663
Trivedi, H., et al. (2023). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. ACL 2023. https://arxiv.org/abs/2212.10509
Vake, L., et al. (2025). HyPE-RAG. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335
Wang, Z., et al. (2023). Learning to filter context for RAG (FILCO). https://arxiv.org/abs/2311.08377
Yan, S., et al. (2024). Corrective retrieval augmented generation. https://arxiv.org/abs/2401.15884
Zhang, X., et al. (2023). MIRACL: A multilingual retrieval dataset covering 18 diverse languages. TACL. https://arxiv.org/abs/2210.09984
Zheng, L., et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
Zou, A., et al. (2023). Universal and transferable adversarial attacks on aligned language models. https://arxiv.org/abs/2307.15043

1. Overall Architecture Classification​

2. Retrieval Architecture​

2.1 Hybrid Search (Vector + BM25)​

2.2 Knowledge Graph Integration​

2.3 Reranking​

3. Embedding Strategy​

4. Context Assembly and Generation​

4.1 Context Filtering​

4.2 Generation​

5. Safety and Adversarial Robustness​

6. Evaluation Methodology​

7. Incremental Crawling and Data Freshness​

8. Comparative Analysis: ZOL vs. State-of-the-Art​

9. Why Generic Medical QA Benchmarks Don't Apply​

9.1 Scope Mismatch: Hospital-Specific vs. General Medical Knowledge​

9.2 Task Mismatch: Navigation vs. Clinical Decision Support​

9.3 Language and Domain Specificity​

9.4 The Appropriate Evaluation Strategy​

10. Three-Layer Evaluation Architecture​

10.1 Layer 1: MTEB-NL / BEIR-NL — Embedding Model Validation​

10.2 Layer 2: Domain-Specific ZOL Retrieval Benchmark​

10.3 Layer 3: End-to-End RAG Evaluation​

10.4 How the Layers Complement Each Other​

11. Is This a Best-in-Class Architecture?​

What "Best-in-Class" Means​

Honest Assessment​

12. Roadmap to Best-in-Class​

Horizon 1: Evaluation Rigour — ✅ Completed (Wave 1)​

Horizon 2: Retrieval Quality — ✅ Mostly Completed (Waves 2 + 4)​

Horizon 3: Safety Hardening — ✅ Completed (Wave 3)​

Horizon 4: Remaining Gaps (Future Work)​

13. Conclusion​

References​

1. Overall Architecture Classification

2. Retrieval Architecture

2.1 Hybrid Search (Vector + BM25)

2.2 Knowledge Graph Integration

2.3 Reranking

3. Embedding Strategy

4. Context Assembly and Generation

4.1 Context Filtering

4.2 Generation

5. Safety and Adversarial Robustness

6. Evaluation Methodology

7. Incremental Crawling and Data Freshness

8. Comparative Analysis: ZOL vs. State-of-the-Art

9. Why Generic Medical QA Benchmarks Don't Apply

9.1 Scope Mismatch: Hospital-Specific vs. General Medical Knowledge

9.2 Task Mismatch: Navigation vs. Clinical Decision Support

9.3 Language and Domain Specificity

9.4 The Appropriate Evaluation Strategy

10. Three-Layer Evaluation Architecture

10.1 Layer 1: MTEB-NL / BEIR-NL — Embedding Model Validation

10.2 Layer 2: Domain-Specific ZOL Retrieval Benchmark

10.3 Layer 3: End-to-End RAG Evaluation

10.4 How the Layers Complement Each Other

11. Is This a Best-in-Class Architecture?

What "Best-in-Class" Means

Honest Assessment

12. Roadmap to Best-in-Class

Horizon 1: Evaluation Rigour — ✅ Completed (Wave 1)

Horizon 2: Retrieval Quality — ✅ Mostly Completed (Waves 2 + 4)

Horizon 3: Safety Hardening — ✅ Completed (Wave 3)

Horizon 4: Remaining Gaps (Future Work)

13. Conclusion

References